CDC File Transfer

github.com

352 points by GalaxySnail 16 hours ago


EdSchouten - 13 hours ago

I’ve also been doing lots of experimenting with Content Defined Chunking since last year (for https://bonanza.build/). One of the things I discovered is that the most commonly used algorithm FastCDC (also used by this project) can be improved significantly by looking ahead. An implementation of that can be found here:

https://github.com/buildbarn/go-cdc

MayeulC - 10 hours ago

I am quite confused; doesn't rsync already use content-defined chunk boundaries, with a condition on the rolling hash to define boundaries?

https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...

https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...

The speed improvements over rsync seem related to a more efficient rolling hash algorithm, and possibly by using native windows executables instead of cygwin (windows file systems are notoriously slow, maybe that plays a role here).

Or am I missing something?

In any case, the performance boost is interesting. Glad the source was opened, and I hope it finds its way into rsync.

rekttrader - 15 hours ago

Nice to see Stadia had some long term benefit. It’s a shame they don’t make a self hosted version but if you did that it’s just piracy in today’s drm world.

wheybags - 11 hours ago

If anyone else was left wondering about the details of how CDC actually generates chunks, I found these two blog posts explained the idea pretty clearly:

https://joshleeb.com/posts/content-defined-chunking.html

https://joshleeb.com/posts/gear-hashing.html

tgsovlerkhgsel - 10 hours ago

Key sentence: "The remote diffing algorithm is based on CDC [Content Defined Chunking]. In our tests, it is up to 30x faster than the one used in rsync (1500 MB/s vs 50 MB/s)."

bilekas - 9 hours ago

This is actually kind of cool, I've implemented my own version of this for my job and seems to be something that's important when the numbers gets tight, but if I remember correctly for their case i guess, wouldn't it have been easier to work from rsynch?

> scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression.

I havent tried it myself but doesnt this already suit that requirement ? https://docs.rc.fas.harvard.edu/kb/rsync/

> Compression If the SOURCE and DESTINATION are on different machines with fast CPUs, especially if they’re on different networks (e.g. your home computer and the FASRC cluster), it’s recommended to add the -z option to compress the data that’s transferred. This will cause more CPU to be used on both ends, but it is usually faster.

Maybe it's not fast enough, but seems a better place to start than scp imo.

AnonC - 12 hours ago

Does anyone know if there’s work being done to integrate this into the standard rsync tool (even as an optional feature)? It seems like a very useful improvement that ought to be available widely. From this website it seems a bit disappointing that it’s not even available for Linux to Linux transfers.

velcrovan - 3 hours ago

> Download the precompiled binaries from the latest release to a Windows device and unzip them. The Linux binaries are automatically deployed to ~/.cache/cdc-file-transfer by the Windows tools. There is no need to manually deploy them.

Interesting, so unlike rsync there is no need to set up a service on the destination Linux machine. That always annoyed me a bit about rsync.

charleshwang - 3 hours ago

Is this how IBM Aspera works too? I was working QA at a game publisher a while ago, and they used it to upload some screen recordings. I didn't understand how it worked, but it was exceeding the upload speeds of the regular office internet.

https://www.ibm.com/products/aspera

shae - 3 hours ago

I've read lots about content defined chunking and recently heard about monoidal hashing. I haven't tried it yet, but monoidal hashing reads like it would be all around better, does anyone know why or why not?

est - 12 hours ago

I wonder if this could be applied to git.

The git blob was hashed with a header of decimal length, and you change a slight bit of content, you have to calculate the hash from start again.

Something like CDC would improve this alot.

ksherlock - 3 hours ago

They should have duck ducked the initialism. CDC is Control Data Corporation.

- 4 hours ago
[deleted]
Sammi - 9 hours ago

It's dead and archived atm, but it looks like a good candidate for revival as an actual active open source project. If you ever wanted to work on something that looks good on your resume, then this looks like your chance. Basically just get it running and released on all major platforms.

0xfeba - 4 hours ago

the name reminds me of Microsoft's RDC, Remote Differential Compression.

https://en.wikipedia.org/wiki/Remote_Differential_Compressio...

mikae1 - 13 hours ago

> cdc_rsync is a tool to sync files from a Windows machine to a Linux device, similar to the standard Linux rsync.

Does this work Linux to Linux too?

modeless - 13 hours ago

Does Steam do something like this for game updates?

phyzome - 7 hours ago

You can see something similar in use in the borg backup tool -- content-defined chunking, before deduplication and encryption.

theamk - 15 hours ago

This CDC is "Content Defined Chunking" - fast incremental file transfer.

Use case is to copy file over slow net, but the previous version is already there, so one can save time by only sending changed parts of the file.

Not to be confused with USB CDC ("communications device class"), an USB device protocol used to present serial ports and network cards. It can also be used to transfer files, the old PC-to-PC cables used it by implementing two network cards connected to each other.

janpmz - 12 hours ago

Tailscale and python3 -m http.server 1337 and then navigating the browser to ip:1337 is a nice way to transfer files too (without chunking). I've made an alias for it alias serveit="python3 -m http.server 1337"

ur-whale - 13 hours ago

Great initiative, especially the new sync algorithm, but giant hurdles to adoption:

- only works on a weird combo of (src platform / dst platform). Why???? How hard is it to write platform-independent code to read/write bytes and send them over the wire in 2025?

- uses bazel, an enormous, Java-based abomination, to build.

Fingers crossed that these can be fixed, or this project is dead in the water.

maxlin - 13 hours ago

Having dabbled in trying to make a quick delta patch system like Steam's, which required me to understand delta patching methods and made small patches to big files in a 10gb+ installation in a few seconds, this is sure is quite interesting!

I wonder if Steam ever decides to supercharge their content handling with some user-space filesystem stuff. With fast connections, there isn't really a reason they couldn't launch games in seconds, streaming data on-demand with smart pre-caching steering based on automatically trained access pattern data. And especially with finely tuned delta patching like this, online game pauses for patching could be almost entirely eliminated. Stop & go instead of a pit stop.

supportengineer - 13 hours ago

Cygwin? Does anyone still use that?

exikyut - 9 hours ago

I'm curious: what does MUC stand for? :)

claytongulick - 14 hours ago

I ran into some of those issues with the chunk size and hash misses when writing bitsync [1], but at the time I didn't want to get too clever with it because I was focused on rsync algorithm compatibility.

This is a cool idea!

[1] https://github.com/claytongulick/bit-sync

laidoffamazon - 12 hours ago

As I've gotten further in my career I've started to wonder - how many engineering quarters did it take to build this for their customers? How did they manage to get this on their own roadmap? This seems like a lot of code surface area for a fairly minimal optimization that would be redundant with a different development substrate (like running Windows on Stadia like how Amazon Luna worked...)

syngrog66 - 7 hours ago

CDC is an unfortunately chosen name