Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Grassley discusses the AI ​​whistleblower protection law in a “start point” interview

June 5, 2025

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

June 5, 2025

AI-Media revolutionizes Lightning International Partner’s fast channels

June 5, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Friday, June 6
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»From files to chunks: Improving HF storage efficiency
Tools

From files to chunks: Improving HF storage efficiency

versatileaiBy versatileaiNovember 21, 2024No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
From Files To Chunks: Improving Hf Storage Efficiency
Share
Facebook Twitter LinkedIn Pinterest Email

ang fan avatar

Hugging Face stores over 30 PB of models, datasets, and space in Git LFS repositories. Git saves and versions at the file level, so any changes you make to a file require you to re-upload the entire asset. With an average Parquet and CSV file on the hub of 200-300 MB, an average Safetensor file of about 1 GB, and GGUF, this becomes an expensive operation. Files can be larger than 8 GB. Imagine changing just one line of metadata in a GGUF file and waiting for a multi-gigabyte file to be uploaded. In addition to user time and transfer costs, Git LFS has to store full versions of both files, which increases storage costs.

The plot below shows LFS storage growth for models, datasets, and space repositories on the hub from March 2022 to September 2024.

parquet layout

The Xet team at Hugging Face takes a different approach to storage by storing files in chunks. By transferring only changed chunks, you can significantly improve both storage efficiency and iteration speed while ensuring reliable access to evolving datasets and models. Here’s how it works:

Fundamentals of content-defined chunking

The method used to chunk files is called Content Defined Chunking (CDC). Rather than treating files as indivisible units, CDC uses data to define boundaries and divides files into variable-sized chunks. To calculate the chunks, we apply a rolling hash algorithm that scans the file for byte sequences.

Consider a file with the following content:

transformer transformer transformer

I’m using text for illustration purposes, but you can use any sequence of bytes.

A rolling hash algorithm calculates a hash over a sliding window of data. In this case, for a window of length 4, the hash is first computed with tran, then rans, then ansf, and so on until the end of the file.

Chunk boundaries are determined if the hash meets predefined conditions such as:

Hash(data) % 2^12 == 0

If the sequence mers generates a hash that satisfies this condition, the file is split into three chunks.

Transformer |Transformer |Transformer

The content of these chunks is hashed to create a chunk-hash-to-byte mapping and ultimately stored in the Content Address Store (CAS). All three chunks are identical, so only one chunk is stored in CAS due to built-in deduplication. 🪄

insert and delete

When the contents of a file change, CDC enables fine-grained updates and makes insert and delete handling robust. Let’s modify the file by inserting super to create the new file contents.

transformers transformers super transformers

If we apply the rolling hash again with the same boundary conditions, the new chunk will look like this:

Transformers | Transformers | Super Transformers

There’s no need to save previously viewed chunks. They are already saved. However, Super Transformers is a new part. Therefore, the cost of saving an updated version of this file is just to upload and save one new chunk.

To validate this optimization in the real world, we benchmarked a previous implementation of CDC-assisted storage in XetHub against Git LFS, and found that storage and transfer performance consistently reached 50% across three iterative development use cases. % I found that it has improved. One example is the CORD-19 dataset, a curated collection of research papers on COVID-19 with 50 incremental updates between 2020 and 2022. Below is a comparison of Xet-based repositories and Git LFS-based repositories.

Metrics Git LFS-based repositories Xet-based repositories Average download time 51 minutes 19 minutes Average upload time 47 minutes 24 minutes Storage used 8.9 GB 3.52 GB

By simply transferring and storing modified chunks, Xet-based repositories with CDC (various techniques to improve compression and streamline network requests) significantly reduce upload/download times. The amount of storage required to capture all versions of a dataset has been significantly reduced. Want to know more? Read the full benchmark.

What CDC means for hubs

How does CDC handle the types of files stored in Hugging Face Hub? To visualize the potential storage savings of applying CDC to a collection of files, use simple deduplication. I created an estimation tool. Running this tool against two versions of the model.safetensors file in openai-community/gpt2 that were uploaded through the repository’s commit history returned the following results:

parquet layout

Greenness reflects the significant overlap between two versions, giving you the opportunity to eliminate duplication both within each file and between versions.

Git LFS Required Storage Xet-backed Required Storage Version 1 664 MB 509 MB Version 2 548 MB 136 MB Total 1.2 GB 645 MB

In this case, using an Xet-based storage backend will significantly save upload/download time for the second version, reducing the total storage footprint by 53%. It is estimated that an additional 10% savings can be achieved using compression.

Initial research on repositories across the hub has yielded positive results for several fine-tuned models and many model checkpoints. Because fine-tuned models change only a subset of parameters, the majority of the model remains unchanged between versions, making it a good candidate for deduplication. Model checkpoints that capture incremental training states are also good targets because changes between checkpoints are often minimal. Both have deduplication rates in the range of 30-85%. Checkpoints in the PyTorch model take up approximately 200 TB of total storage on the hub. With 50% deduplication, you can save up to 100 TB of storage immediately and approximately 7-8 TB of storage each month going forward.

In addition to reducing storage costs, chunk-level deduplication also improves upload/download speeds because only changed chunks are transferred. This is a huge benefit for teams working with multiple versions of models or datasets, as it minimizes user and machine latency.

Our team is currently working on a POC of Xet-based storage for the hub and hopes to deploy some Xet-based repositories in early 2025. Follow us to learn more as we share our learnings on future topics like scaling CDC across globally distributed repositories. , balance network performance, privacy boundaries, and parallelize chunking algorithms.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleDemis Hassabis and John Jumper win Nobel Prize in Chemistry
Next Article New security issues arise as banks apply AI
versatileai

Related Posts

Tools

AI enables the transition from enablement to strategic leadership

June 5, 2025
Tools

kv cache from scratch in nanovlm

June 4, 2025
Tools

Gemini 2.5 native audio features

June 4, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

New Star: Discover why 보니 is the future of AI art

February 26, 20253 Views

How to use Olympic coders locally for coding

March 21, 20252 Views

Dell, IBM and HPE must operate at a single digit margin when it comes to the server market, and only gets worse

March 10, 20252 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

New Star: Discover why 보니 is the future of AI art

February 26, 20253 Views

How to use Olympic coders locally for coding

March 21, 20252 Views

Dell, IBM and HPE must operate at a single digit margin when it comes to the server market, and only gets worse

March 10, 20252 Views
Don't Miss

Grassley discusses the AI ​​whistleblower protection law in a “start point” interview

June 5, 2025

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

June 5, 2025

AI-Media revolutionizes Lightning International Partner’s fast channels

June 5, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?