Content definition chunking (CDC) plays a central role in enabling deduplication within XET-backed repositories. The idea is simple. Split the data in each file into chunks, store only the unique ones, and enjoy the benefits.
In reality, it’s more complicated. If focusing only on maximizing deduplication, the design requires the smallest possible chunk size. By doing that, you create significant overhead for the infrastructure and builders on the hub.
Embracing Face’s XET team, bringing CDC from theory to production, providing AI builders with fast uploads and downloads (in some cases a factor of 2-3 times). Our teaching principles are simple. It enables rapid experimentation and collaboration between teams building and iterating models and datasets. This means focusing on more than just deduplication. It optimizes how data moves across the network, how it stores, and the overall development experience.
The reality of scaling deduplication
Imagine uploading a 200GB repository to a hub. Today there are several ways to do this, but all use a file-centric approach. To speed up file transfers to the hub, Xet-Core and hf_xet are open sourced. This is an integration with Huggingface_hub, which uses a chunk-based approach written in Rust.
Considering a 200GB repository with unique chunks, it’s 3 million entries (approximately 64kb per chunk) supporting all repositories in the Content Address Store (CAS). If a new version of the model is uploaded or if a branch in the repository is created with different data, a more unique chunk will be added and drives entries in the CAS.
With nearly 45pb models spanning 2 million models on the hub, datasets, and space repository, a purely chunk-based approach could result in 690 billion chunks. Managing the content of this volume using only chunks is not simply feasible to:
Network Overhead: When each chunk is downloaded or uploaded individually, each upload and download generates millions of requests, overwhelming both the client and the server. Even a batching query just shifts the problem to the storage layer. Infrastructure Overhead: Naive CAS that tracks chunks individually requires billions of entries and generates monthly sudden invoices for services like DynamoDB and S3. Hug the Face scale and this is added immediately.
In short, the network requests balloons, the database struggles to manage the metadata, and the cost of adjusting all of each chunk ski rocket while waiting for the files to be transferred.
Design Principles for Large-Scale Deduplication
These challenges lead to important realizations:
Deduplication is performance optimization, not the ultimate goal.
The ultimate goal is to improve the experience of builders that repetitively collaborate on models and datasets. System components from the client to the storage tier do not need to guarantee deduplication. Instead, they leverage deduplication as one tool among many to help with this.
By loosening the deduplication constraints, naturally we arrive at the second design principle.
Avoid communication or storage strategies that scale 1:1 by the number of chunks.
What does this mean? Scaling by aggregate.
Scaling deduplication through aggregation
Aggregations take chunks, group them together, and intelligently referencing them in a way that provides clever (and practical) benefits.
Blocks: Instead of transferring and storing chunks, you can group data into blocks up to 64MB after deduplication. Blocks are still kept to content, but this reduces CAS entries by 1,000 times. Sharks provide mapping between files and chunks (see Blocks). This allows you to determine which parts of the file have been changed. This refers to fragments generated from past uploads. If it is already known that chunks already exist in the CAS, they will be skipped and reduce unnecessary forwarding and queries.
Together, blocks and shards unlock a great advantage. However, when someone uploads a new file, how can you know if a chunk has been uploaded previously so that it can eliminate unnecessary requests? Running network queries for every chunk is not scalable and violates the “No 1:1” principle above.
The solution is a critical chunk that is a subset of 0.1% of all chunks selected in a simple elastic condition based on chunk hash. To provide a global index for these important chunks and the fragments they were found, when the chunks are queried, the associated shards are returned and local deduplication is provided. This allows us to utilize the principle of spatial locality. If a key chunk is referenced in a shard, then other similar chunk references may be available in the same shard. This further improves deduplication and reduces network and database requests.
Actually aggregated deduplication
The hub currently stores .gguf files over 3.5pb, most of which are quantized versions of other models on the hub. Quantized models represent an interesting opportunity for deduplication due to the scaled nature of quantization, where values are limited to smaller integer ranges. This limits the range of weight matrix values, which naturally leads to more iterations. Furthermore, many repositories of quantized models store multiple different variants (Q4_K, Q3_K, Q5_K, etc.) with significant overlap.
A good example of this in practice is Bartowski/Gemma-2-9B-It-Gguf, which includes 29 quantizations of a total of 191GB, of Google/Gemma-2-9B-IT. To upload, use hf_xet integrated with Huggingface_hub to perform Chunk-level deduplication locally, aggregate and save it at the block level.
Once you upload, you can see some cool patterns! We included visualizations showing deduplication ratios for each block. The darker the block, the more frequently it is referenced throughout the model version. When you move to the space that hosts this visualization, hovering over a heatmap cell, clicking on the cell will highlight all references to orange blocks in all models, and all other shared blocks will be shared. The file is selected.
One block of deduplication may only represent a few MB of savings, but as you can see, there are many overlapping blocks! Many of these blocks are added quickly. Instead of uploading 191GB, the XET backing version of the Gemma-2-9B-IT-GGUF repository store stores a total of about 97GB unique blocks in a test CAS environment (saving ~94GB) I will.
While improving storage is important, the real benefit is that this means for contributors to the hub. At 50MB/s, deduplication optimization corresponds to a 4-hour difference in upload time. Approximately double speedup:
Repository size upload time @50MB/s Original 191 GB 509 minutes Xet-Backed 97 GB 258 minutes
Similarly, local chunk caches significantly speed up downloads. If the file is modified, or if a new quantization with a significant overlap with the local chunk cache is added, there is no need to re-download the unchanged chunks. This is in contrast to a file-based approach where you need to download the entire new or updated file.
Taken together, this shows how local chunk-level deduplication, combined with block-level aggregation, dramatically streamlines development on hubs as well as storage. By providing this level of efficiency with file transfer, AI builders can move faster, iterate faster, and worry less about hitting infrastructure bottlenecks. People who are pushing large files to the hub (whether they’re pushing quantization of new models or updated versions of training sets) are trying to shift their focus to building and sharing rather than waiting or troubleshooting It’s helpful.
We’re at work and deploying our first Xet support repository in the coming weeks and months! As you do that, release more updates to bring these speeds to all builders on the hub to make file transfers invisible.
Follow us on our hub to learn more about our progress!