Blockdiff: We built our own file format for VM disk snapshots

cognition.ai

81 points by cyanf 15 hours ago


ori_b - 5 hours ago

It's a bit surprising that they dismissed qcow2, because it does exactly what they want. It's also pretty easy to implement in a vm. The file format is a 2 level page table per snapshot, with pointers to blocks. I suspect they didn't really look at it enough.

Here's the implementation I did for openbsd; it's around 700 lines, including the gunk to interface with the hypervisor.

https://github.com/openbsd/src/blob/master/usr.sbin/vmd/vioq...

It's not a good choice for computing diffs, but you can run your VM directly off a read-only base qcow2, with all deltas going into a separate file. That file can either be shipped around or discarded. And multiple VMs can share the same read only base.

So, it probably would have been better to write the code for the hypervisor, and end up with something far more efficient overall.

riedel - 10 hours ago

I wonder why EC2 is so slow on snapshots. We use CEPH internally and if we wanted we could export the diffs [0] (we use proxmox backup instead). Snapshots always felt blazing fast (I still often forget to do them, need a solution to trigger them from the guest)

[0] https://ceph.io/en/news/blog/2013/incremental-snapshots-with...

da-x - 3 hours ago

I wonder didn't they used VDO thin provisioning with LVM2.

Also, a few years ago I've implemented VM management tool called 'vmess', in which the concept is to maintain a tree of QCOW2 files, which R/W snapshots are at the leafs and R/O snapshots are the nodes of the tree. The connection up to the root is made via QCOW2 backing-file store mechanism, so a newly created leaf starts a 0 space. I did this because libvirt+qemu impose various annoying limitations surrounding snapshots-with-in-qcow2, and I liked the idea of file-per-snapshot.

VDO: https://docs.kernel.org/admin-guide/device-mapper/vdo.html (original project URL: https://github.com/dm-vdo/kvdo )

vmess: https://github.com/da-x/vmess

hanwenn - 11 hours ago

Thanks for writing the blog post; it was a fascinating read!

I was curious about a couple of things:

* Have you considered future extensions where you can start the VM before you completed the FS copy?

* You picked XFS over ZFS and BTRFS. Any reason why XFS in particular?

* You casually mention that you wrote 'otterlink', your own hypervisor. Isn't that by itself a complicated effort worthy of a blog post? Or is it just mixing and matching existing libraries from the Rust ecosystem?

pixelbeat__ - 11 hours ago

I see you use flags to determine if a file needs syncing. When we used fiemap within GNU cp we required FIEMAP_FLAG_SYNC to get robust operation.

(We have since removed the fiemap code from cp, and replaced it with LSEEK_DATA, LSEEK_HOLE)

petepete - 8 hours ago

I love this, and the post made a complex topic easy to follow for a mere mortal like me. Including images of tables with no alt text is a bit of a barrier though, especially when the company is called Cognition.

Tractor8626 - an hour ago

Interesting tool. Something like btrfs send/receive but on a file level and fs agnostic.

hugodutka - 11 hours ago

Have you considered https://github.com/containerd/overlaybd? It seems to offer very similar features to blockdiff.

stefanha - 6 hours ago

qemu-img convert supports copy_file_range(2) too. Was the `--copy-range-offloading` option used in the benchmark?

It would be helpful to share the command-line and details of how benchmarks were run.

polskibus - 12 hours ago

Can this be used in a public cloud provider , to speed up VM provisioning in CI/CD pipelines? Im looking for ways to speed up app provisioning for e2e tests.

imiric - 8 hours ago

This is interesting. Is it hypervisor-agnostic?

Ideally, I would like to use something like this without being forced to use a specific file system. This is essentially what qcow2 does, and it's a shame that it's not supported by all hypervisors. But then your implementation would need to be much more complex, and implement what CoW filesystems give you for free, so I appreciate that this is possible in 600 LOCs.

Also, your repo doesn't have a license, which technically makes it unusable.

cyanf - 10 hours ago

The blog's title can be misleading here, "we" in this context refers to the Cognition team. I don't work at Cognition, just thought this was interesting.

UomoNeroNero - 8 hours ago

I don’t know how to express to you how stupid, inadequate, and envious I feel of this level of competence. For me this article has the density of slaps of a plutonium ingot. It’s moving to read (and “maybe” understand, given how well it’s written). Wow, maximum respect, truly.