From files to chunks: Improving HF storage efficiency

ang fan avatar

Hugging Face stores over 30 PB of models, datasets, and space in Git LFS repositories. Git saves and versions at the file level, so any changes you make to a file require you to re-upload the entire asset. With an average Parquet and CSV file on the hub of 200-300 MB, an average Safetensor file of about 1 GB, and GGUF, this becomes an expensive operation. Files can be larger than 8 GB. Imagine changing just one line of metadata in a GGUF file and waiting for a multi-gigabyte file to be uploaded. In addition to user time and transfer costs, Git LFS has to store full versions of both files, which increases storage costs.

The plot below shows LFS storage growth for models, datasets, and space repositories on the hub from March 2022 to September 2024.

The Xet team at Hugging Face takes a different approach to storage by storing files in chunks. By transferring only changed chunks, you can significantly improve both storage efficiency and iteration speed while ensuring reliable access to evolving datasets and models. Here’s how it works:

Fundamentals of content-defined chunking

The method used to chunk files is called Content Defined Chunking (CDC). Rather than treating files as indivisible units, CDC uses data to define boundaries and divides files into variable-sized chunks. To calculate the chunks, we apply a rolling hash algorithm that scans the file for byte sequences.

Consider a file with the following content:

transformer transformer transformer

I’m using text for illustration purposes, but you can use any sequence of bytes.

A rolling hash algorithm calculates a hash over a sliding window of data. In this case, for a window of length 4, the hash is first computed with tran, then rans, then ansf, and so on until the end of the file.

Chunk boundaries are determined if the hash meets predefined conditions such as:

Hash(data) % 2^12 == 0

If the sequence mers generates a hash that satisfies this condition, the file is split into three chunks.

Transformer |Transformer |Transformer

The content of these chunks is hashed to create a chunk-hash-to-byte mapping and ultimately stored in the Content Address Store (CAS). All three chunks are identical, so only one chunk is stored in CAS due to built-in deduplication. 🪄

insert and delete

When the contents of a file change, CDC enables fine-grained updates and makes insert and delete handling robust. Let’s modify the file by inserting super to create the new file contents.

transformers transformers super transformers

If we apply the rolling hash again with the same boundary conditions, the new chunk will look like this:

Transformers | Transformers | Super Transformers

There’s no need to save previously viewed chunks. They are already saved. However, Super Transformers is a new part. Therefore, the cost of saving an updated version of this file is just to upload and save one new chunk.

To validate this optimization in the real world, we benchmarked a previous implementation of CDC-assisted storage in XetHub against Git LFS, and found that storage and transfer performance consistently reached 50% across three iterative development use cases. % I found that it has improved. One example is the CORD-19 dataset, a curated collection of research papers on COVID-19 with 50 incremental updates between 2020 and 2022. Below is a comparison of Xet-based repositories and Git LFS-based repositories.

Metrics Git LFS-based repositories Xet-based repositories Average download time 51 minutes 19 minutes Average upload time 47 minutes 24 minutes Storage used 8.9 GB 3.52 GB

By simply transferring and storing modified chunks, Xet-based repositories with CDC (various techniques to improve compression and streamline network requests) significantly reduce upload/download times. The amount of storage required to capture all versions of a dataset has been significantly reduced. Want to know more? Read the full benchmark.

What CDC means for hubs

How does CDC handle the types of files stored in Hugging Face Hub? To visualize the potential storage savings of applying CDC to a collection of files, use simple deduplication. I created an estimation tool. Running this tool against two versions of the model.safetensors file in openai-community/gpt2 that were uploaded through the repository’s commit history returned the following results:

Greenness reflects the significant overlap between two versions, giving you the opportunity to eliminate duplication both within each file and between versions.

Git LFS Required Storage Xet-backed Required Storage Version 1 664 MB 509 MB Version 2 548 MB 136 MB Total 1.2 GB 645 MB

In this case, using an Xet-based storage backend will significantly save upload/download time for the second version, reducing the total storage footprint by 53%. It is estimated that an additional 10% savings can be achieved using compression.

Initial research on repositories across the hub has yielded positive results for several fine-tuned models and many model checkpoints. Because fine-tuned models change only a subset of parameters, the majority of the model remains unchanged between versions, making it a good candidate for deduplication. Model checkpoints that capture incremental training states are also good targets because changes between checkpoints are often minimal. Both have deduplication rates in the range of 30-85%. Checkpoints in the PyTorch model take up approximately 200 TB of total storage on the hub. With 50% deduplication, you can save up to 100 TB of storage immediately and approximately 7-8 TB of storage each month going forward.

In addition to reducing storage costs, chunk-level deduplication also improves upload/download speeds because only changed chunks are transferred. This is a huge benefit for teams working with multiple versions of models or datasets, as it minimizes user and machine latency.

Our team is currently working on a POC of Xet-based storage for the hub and hopes to deploy some Xet-based repositories in early 2025. Follow us to learn more as we share our learnings on future topics like scaling CDC across globally distributed repositories. , balance network performance, privacy boundaries, and parallelize chunking algorithms.

versatileai

See Full Bio

What's Hot

Grassley discusses the AI whistleblower protection law in a “start point” interview

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

AI-Media revolutionizes Lightning International Partner’s fast channels

AI enables the transition from enablement to strategic leadership

kv cache from scratch in nanovlm

Gemini 2.5 native audio features

New Star: Discover why 보니 is the future of AI art

How to use Olympic coders locally for coding

Dell, IBM and HPE must operate at a single digit margin when it comes to the server market, and only gets worse

Most Popular