CDC File Transfer
github.com352 points by GalaxySnail 16 hours ago
352 points by GalaxySnail 16 hours ago
I’ve also been doing lots of experimenting with Content Defined Chunking since last year (for https://bonanza.build/). One of the things I discovered is that the most commonly used algorithm FastCDC (also used by this project) can be improved significantly by looking ahead. An implementation of that can be found here:
This lookahead is very similar to the "lazy matching" used in Lempel-Ziv compressors! https://fastcompression.blogspot.com/2010/12/parsing-level-1...
Did you compare it to Buzhash? I assume gearhash is faster given the simpler per iteration structure. (also, rand/v2's seeded generators might be better for gear init than mt19937)
Yeah, GEAR hashing is simple enough that I haven't considered using anything else.
Regarding the RNG used to seed the GEAR table: I don't think it actually makes that much of a difference. You only use it once to generate 2 KB of data (256 64-bit constants). My suspicion is that using some nothing-up-my-sleeve numbers (e.g., the first 2048 binary digits of π) would work as well.
The random number generation could match the first 2048 digits of pi, so if it works with _any_ random number...
If it doesn't work with any random number, then some work better than others then intuitively you can find a (or a set of) best seed(s).
I just wanted to let you know, this is really cool. Makes me wish I still used Bazel.
What would you estimate the performance implications of using go-cdc instead of fastcdc in their cdc_rsync are?
In my case I observed a ~2% reduction in data storage when attempting to store and deduplicate various versions of the Linux kernel source tree (see link above). But that also includes the space needed to store the original version.
If we take that out of the equation and only measure the size of the additional chunks being transferred, it's a reduction of about 3.4%. So it's not an order of magnitude difference, but not bad for a relatively small change.
I wonder whether there's a role for AI here.
(Please don't hurt me.)
AI turns out to be useful for data compression (https://statusneo.com/creating-lossless-compression-algorith...) and RF modulation optimization (https://www.arxiv.org/abs/2509.04805).
Maybe it'd be useful to train a small model (probably of the SSM variety) to find optimal chunking boundaries.
Yeah, that's true. Having some kind of chunking algorithm that's content/file format aware could make it work even better. For example, it makes a lot of sense to chunk source files at function/scope boundaries.
In my case I need to ensure that all producers of data use exactly the same algorithm, as I need to look up build cache results based on Merkle tree hashes. That's why I'm intentionally focusing on having algorithms that are not only easy to implement, but also easy to implement consistently. I think that MaxCDC implementation that I shared strikes a good balance in that regard.
I am quite confused; doesn't rsync already use content-defined chunk boundaries, with a condition on the rolling hash to define boundaries?
https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...
https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...
The speed improvements over rsync seem related to a more efficient rolling hash algorithm, and possibly by using native windows executables instead of cygwin (windows file systems are notoriously slow, maybe that plays a role here).
Or am I missing something?
In any case, the performance boost is interesting. Glad the source was opened, and I hope it finds its way into rsync.
> doesn't rsync already use content-defined chunk boundaries, with a condition on the rolling hash to define boundaries?
No, it operates on fixed size blocks over the destination file. However, by using a rolling hash, it can detect those blocks at any offset within the source file to avoid re-transferring them.
rsync seems frozen in time; it’s been around for ages and there are so many basic and small quality of life improvements that could have been made that haven’t been. I have always assumed it’s like vim now: only really maintained in theory, not in practice.
Please bear in mind that there are [now] two distinct rsync codebases.
The original is the GPL variant [today displaying "Upgrade required"]:
The second is the BSD clone:
The BSD version would be used on platforms that are intolerant of later versions of the GPL (Apple, Android, etc.).
So you not used vim or neovim in the last 10 years ?
To be fair, there was a roughly 6 year period when vim saw one very minor release. That slow development period was the impetus for the fork of Neovim.
I know. I use Neovim. But since that, and thanks to Neovim, Vim has speedup and got some improvements.
Nice to see Stadia had some long term benefit. It’s a shame they don’t make a self hosted version but if you did that it’s just piracy in today’s drm world.
for self-hosted game streaming you can use moonlight + sunshine, they work really well in my experience.
Exactly my experience too. I easily get 60fps at 1080p over wireless LAN with moonlight + sunshine. Parsec is also another option
Probably wouldn’t have been feasible - I heard developers had to compile their games with Stadia support. Maybe it was an entirely different platform, with its own alternative to DirectX, or maybe had some kind of lightweight emulation (such as Proton) but I remember vaguely the few games I played had custom stadia key bindings (with stadia symbols). They would display like that within the game. So definitely some customization did happen.
This is unlike the model that PlayStation, Xbox and even Nvidia are following - I don’t know about Amazon Luna.
Stadia games were just run on Linux with Vulkan + some extra Stadia APIs for their custom swapchain and other bits and pieces. Stadia games were basically just Linux builds.
As I understand it, GeForce Now actually does require changes to the game to run in the standard and until recently only option of "Ready To Play". This is the supposed reason that new updates to games sometimes take time to get released on the service, since either the developers themselves or Nvidia needs to modify it to work correctly on the service. I have no idea if this is true, but it makes sense to me.
They recently added "Install to Play" where you can install games from Steam that aren't modified for the service. They charge for storage for this though.
Sadly, there's still tons of games unavaiable because publishers need to opt in and many don't.
They did have a dev console based on a Lenovo workstation, as well as off-menu AMD V340L 2x8GB GPUs, both later leaked into Internet auctions. So some hardware and software customizations had definitely happened.
For self hosted remote streaming of game look at Moonlight / Sunshine (Apollo)
Stadia required special version of games, so it wouldn't be that useful
It's a shame that virtual / headless displays are such a mess on both Linux and Windows. I use a 32:9 ultrawide and stream to 16:9/16:10 devices, and even with hours of messing around with an HDMI dummy and kscreen-doctor[1] it was still an unreliable mess. Sometimes it wouldn't work when the machine was locked, and sometimes Sunshine wouldn't restore the resolution on the physical monitor (and there's no session timeout either).
Artemis is a bit better, but it still requires per-device setup of displays since it somehow doesn't disable the physical output next to the virtual one. Those drivers also add latency to the capture (the author of looking glass really dislikes them because they undo all the hard work of near-zero latency).
[1]: https://github.com/acuteaura/universe/blob/main/systems/_mod...
On Linux with an AMD i/dGPU, you can set the `virtual_display` module parameter for `amdgpu`[1] and do what you want without the need for an HDMI dummy or weird software. It's also hardware accelerated.
> virtual_display (charp)
> Set to enable virtual display feature. This feature provides a virtual display hardware on headless boards or in virtualized environments. It will be set like xxxx:xx:xx.x,x;xxxx:xx:xx.x,x. It’s the pci address of the device, plus the number of crtcs to expose. E.g., 0000:26:00.0,4 would enable 4 virtual crtcs on the pci device at 26:00.0. The default is NULL.
[1]https://www.kernel.org/doc/html/latest/gpu/amdgpu/module-par...
Use Apollo (a fork of Sunshine) : https://github.com/ClassicOldSong/Apollo
> Built-in Virtual Display with HDR support that matches the resolution/framerate config of your client automatically
It includes a virtual screen driver, and it handles all the crap (it can disable your physical screen when streaming and re enable after, it can generate the virtual screen by client to match the client's needs, or do it by game, or ...)
I stream from my main pc to both my laptop and my steamdeck, and each get the screen that matches them without having to do anything more than connect to it with moonlight.
Artemis/Apollo are mentioned in the post above - yeah they work better than the out of box experience, but you still have to configure your physical screen to be off for every virtual display. It unfortunately only runs on Windows and my machine usually doesn't. I also only have one dGPU and a Raphael iGPU (which are sensitive to memory overclocks) and I like the Linux gaming experience for the most part, so while I did have a working gaming VM, it wasn't for me (or I'd want another GPU).
I don't understand, "self hosted stadia" is just one of the myriad of services and tools that do literally that.
Steam has game streaming built in and works very well. Both Nvidia and AMD built this into their GPU drivers at one point or another (I think the AMD one was shut down?)
Those are just the solutions I accidentally have installed despite not using that functionality. You can even stream games from the steam deck!
Sony even has a system to let you stream your PS4 to your computer anywhere and play it. I think Microsoft built something similar for Xbox.
What do you mean piracy in the a DRM world. Like being able to share your own PC games through the cloud?
You can share the games you authored all you like. If you bought a license to play them that's another story.
Stadia was sadly engineered in such a way that this is impossible.
Speaking of which, who thought up the idea to use custom hardware for this that would _already be obsolete_ a year later? Who considered using Linux native instead of a compat layer? Why did the original Stadia website not even have a search bar??
> it’s just piracy in today’s drm world
...which is more important / needed than ever. I encourage every who asks to get my music from bit torrent instead of spotify.
Why not something like Bandcamp, or other DRM-free purchase options?
I'm not above piracy if there's no DRM free option (or if the music is very old or the artist is long dead), but I still believe in supporting artists who actively support freedom.
Yep, I put everything on bandcamp. https://justinholmes.bandcamp.com/
Even better though, is a P2P service that is censorship resistant.
But yeah I like Bandcamp plenty.
> artists who actively support freedom.
The bluegrass world is quickly becoming this.
So you create and seed your torrents with your music, and present them prominently on your site?
I was doing that for a while, and running a seedbox. However, on occasions when the seedbox was the only seeder, clients were unable to begin the download, for reasons I've never figured out. If I also seeded from my desktop, then fan downloads were being fed by both the desktop and the seedbox. But without the desktop, the seedbox did nothing.
I need to revisit this in the next few weeks as I release my second record (which, if I may boast, has an incredible ensemble of most of my favorite bluegrass musicians on it; it was a really fun few days at the studio).
Currently I do pin all new content to IPFS and put the hashes in the content description, as with this video of Drowsy Maggie with David Grier: https://www.youtube.com/watch?v=yTI1HoFYbE0
Another note: our study of Drowsy Maggie was largely made possible by finding old-and-nearly-forgotten versions in the Great78 project, which of course the industry attempted to sue out of existence on an IP basis. This is another example of how IP is a conceptual threat to traditional music - we need to be able to hear the tradition in order to honor it.
If anyone else was left wondering about the details of how CDC actually generates chunks, I found these two blog posts explained the idea pretty clearly:
Thanks, I was puzzled by that. They kind of gloss over it in the original link.
Looking forward to reading those.
Key sentence: "The remote diffing algorithm is based on CDC [Content Defined Chunking]. In our tests, it is up to 30x faster than the one used in rsync (1500 MB/s vs 50 MB/s)."
This is actually kind of cool, I've implemented my own version of this for my job and seems to be something that's important when the numbers gets tight, but if I remember correctly for their case i guess, wouldn't it have been easier to work from rsynch?
> scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression.
I havent tried it myself but doesnt this already suit that requirement ? https://docs.rc.fas.harvard.edu/kb/rsync/
> Compression If the SOURCE and DESTINATION are on different machines with fast CPUs, especially if they’re on different networks (e.g. your home computer and the FASRC cluster), it’s recommended to add the -z option to compress the data that’s transferred. This will cause more CPU to be used on both ends, but it is usually faster.
Maybe it's not fast enough, but seems a better place to start than scp imo.
> The remote diffing algorithm is based on CDC. In our tests, it is up to 30x faster than the one used in rsync (1500 MB/s vs 50 MB/s).
rsync in my experience is not optimized for a number of use cases.
Game development, in particular, often involves truly enormous sizes and numbers of assets, particularly for dev build iteration, where you're sometimes working with placeholder or unoptimized assets, and debug symbol bloated things, and in my experience, rsync scales poorly for speed of copying large numbers of things. (In the past, I've used naive wrapper scripts with pregenerated lists of the files on one side and GNU parallel to partition the list into subsets and hand those to N different rsync jobs, and then run a sync pass at the end to cleanup any deletions.)
Just last week, I was trying to figure out a more effective way to scale copying a directory tree that was ~250k files varying in size between 128b and 100M, spread out across a complicatedly nested directory structure of 500k directories, because rsync would serialize badly around the cost of creating files and directories. After a few rounds of trying to do many-way rsync partitions, I finally just gave the directory to syncthing and let its pregenerated index and watching handle it.
Try this: https://alexsaveau.dev/blog/projects/performance/files/fuc/f...
> The key insight is that file operations in separate directories don’t (for the most part) interfere with each other, enabling parallel execution.
It really is magically fast.
EDIT: Sorry, that tool is only for local copies. I just remembered you're doing remote copies. Still worth keeping in mind.
Does anyone know if there’s work being done to integrate this into the standard rsync tool (even as an optional feature)? It seems like a very useful improvement that ought to be available widely. From this website it seems a bit disappointing that it’s not even available for Linux to Linux transfers.
You can find some thoughts on it not working for Linux to Linux, and more broad compatibility, here[1] and here[2].
[1] - https://github.com/google/cdc-file-transfer/issues/56#issuec...
> Download the precompiled binaries from the latest release to a Windows device and unzip them. The Linux binaries are automatically deployed to ~/.cache/cdc-file-transfer by the Windows tools. There is no need to manually deploy them.
Interesting, so unlike rsync there is no need to set up a service on the destination Linux machine. That always annoyed me a bit about rsync.
The most common use for rsync is to run it over ssh where it starts the receiving side automatically. cdc is doing the exact same thing.
You were misinformed if you thought using rsync required setting up an rsync service.
Is this how IBM Aspera works too? I was working QA at a game publisher a while ago, and they used it to upload some screen recordings. I didn't understand how it worked, but it was exceeding the upload speeds of the regular office internet.
I've read lots about content defined chunking and recently heard about monoidal hashing. I haven't tried it yet, but monoidal hashing reads like it would be all around better, does anyone know why or why not?
I wonder if this could be applied to git.
The git blob was hashed with a header of decimal length, and you change a slight bit of content, you have to calculate the hash from start again.
Something like CDC would improve this alot.
It's done in xet as a replacement for git lfs: https://huggingface.co/blog/from-files-to-chunks
Backup tools like restic/borg do this, I wonder if anyone has used them to replace git yet.
They should have duck ducked the initialism. CDC is Control Data Corporation.
It's dead and archived atm, but it looks like a good candidate for revival as an actual active open source project. If you ever wanted to work on something that looks good on your resume, then this looks like your chance. Basically just get it running and released on all major platforms.
the name reminds me of Microsoft's RDC, Remote Differential Compression.
https://en.wikipedia.org/wiki/Remote_Differential_Compressio...
> cdc_rsync is a tool to sync files from a Windows machine to a Linux device, similar to the standard Linux rsync.
Does this work Linux to Linux too?
Does Steam do something like this for game updates?
Steam unfortunately doesn't use a rolling hash like this (fastcdc, buzhash, etc.), but rather slices files into 1MB chunks, hashes them, and updates at that granularity.
https://partner.steamgames.com/doc/sdk/uploading#AppStructur...
You can see something similar in use in the borg backup tool -- content-defined chunking, before deduplication and encryption.
This CDC is "Content Defined Chunking" - fast incremental file transfer.
Use case is to copy file over slow net, but the previous version is already there, so one can save time by only sending changed parts of the file.
Not to be confused with USB CDC ("communications device class"), an USB device protocol used to present serial ports and network cards. It can also be used to transfer files, the old PC-to-PC cables used it by implementing two network cards connected to each other.
The clever trick is how it recognizes insertions. The standard trick of computing hashes on fixed sized blocks works efficiently for substitutions but is totally defeated by an insertion or deletion.
Instead with CDC the block boundaries are define by the content, so an insertion doesn’t change the block boundary, so it can tell the subsequent blocks are unchanged. I haven’t read the CDC paper but I’m guessing they just use some probabilistic hash function to define certain strings as block boundaries.
Probably worth noting that ordinary rsync can also handle insertions/deletions because it uses a rolling hash. Rsync's method is bandwidth-efficient, but not especially CPU-efficient.
> I haven’t read the CDC paper but I’m guessing they just use some probabilistic hash function to define certain strings as block boundaries.
You choose a number of bits (say, 12) and then evenly distribute these in a 48-bit mask; if the hash at any point has all these bits on, that defines a boundary.
not to be confused with Center of Disease Control
Especially in the context of recent (that is, last 10 years) removal of data from Center of Disease Control sources due to changing political winds.
Tailscale and python3 -m http.server 1337 and then navigating the browser to ip:1337 is a nice way to transfer files too (without chunking). I've made an alias for it alias serveit="python3 -m http.server 1337"
Great initiative, especially the new sync algorithm, but giant hurdles to adoption:
- only works on a weird combo of (src platform / dst platform). Why???? How hard is it to write platform-independent code to read/write bytes and send them over the wire in 2025?
- uses bazel, an enormous, Java-based abomination, to build.
Fingers crossed that these can be fixed, or this project is dead in the water.
Hey the repo is archived and as I read the tool was meant to solve one specific scenario. Not everything has to please the public.
The great thing is googlers could make such a tool and publish it in the first place. So you can improve it to use it in your scenario. Or become maintainer of such a tool.
> only works on a weird combo of (src platform / dst platform). Why????
Stadia ran on linux, and 99.9999999% of game development is done on windows (and cross compiled for linux).
> Fingers crossed that these can be fixed, or this project is dead in the water.
The project was archived 9 months ago, and hasn't had a commit in 2 years. It's already dead.
First thing might be considered a bug by googles, but everyone I have talked to LOVED their bazel or at least thought of it as superior to any other tool to do the same stuff.
Literally tonight my buddy was talking about how months long plan to introduce bazel into his companies infra.
Having dabbled in trying to make a quick delta patch system like Steam's, which required me to understand delta patching methods and made small patches to big files in a 10gb+ installation in a few seconds, this is sure is quite interesting!
I wonder if Steam ever decides to supercharge their content handling with some user-space filesystem stuff. With fast connections, there isn't really a reason they couldn't launch games in seconds, streaming data on-demand with smart pre-caching steering based on automatically trained access pattern data. And especially with finely tuned delta patching like this, online game pauses for patching could be almost entirely eliminated. Stop & go instead of a pit stop.
Someone already created that[1] using custom kernel driver and there own CDN, but they seem to of abandoned it[2], maybe because they would of attracted Valve's wrath trying to monetized it.
[1] https://web.archive.org/web/20250517130138/https://venusoft....
That's actually quite interesting. Not entirely what I had in mind but close! My version would have only the first boot be a bit slow, but the aspect of dynamically replacing local content there is cool.
This would be extra cool for LAN parties with good network hardware
steam game installs are bottlenecked by cpu speed these days due to the heavy compression, so doubt it be much faster
Well, the amount of compression isn't set in stone, obviously a system like this would run with a less compressed dataset to balance game boot time, time taken away from running the game by compression, and scale on available bandwidth.
With low bandwidth just downloading the whole thing while having enough compression to 80% saturate the local system would be optimal instead, sure.
Cygwin? Does anyone still use that?
Cygwin has its benefits over WSL. e.g. It does not run in a VM for example and therefore does not suffer from the resulting performance penalty.
I'm curious: what does MUC stand for? :)
I ran into some of those issues with the chunk size and hash misses when writing bitsync [1], but at the time I didn't want to get too clever with it because I was focused on rsync algorithm compatibility.
This is a cool idea!
As I've gotten further in my career I've started to wonder - how many engineering quarters did it take to build this for their customers? How did they manage to get this on their own roadmap? This seems like a lot of code surface area for a fairly minimal optimization that would be redundant with a different development substrate (like running Windows on Stadia like how Amazon Luna worked...)
You are thinking like a manager, but this (as with most of the good things in life) has been built by doers, artisans, and engineers (developers).
This is a problem interesting enough, with huge potential benefits for humanity if it manages to improve anything, which it did.
It's easy to get work on this problem. Any effort that shortens game deploy time will be highly visible. It's something every game needs, and every member of the team deals with.
Im sympathetic to this idea but it seems like this is a situation that most game developers don’t have because they just develop locally. Sometimes they do need to push to a console which this could help with if Microsoft or Sony built this into their dev kit tooling.
CDC is an unfortunately chosen name