Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Only 13% use AI-specific protection, says Wiz – Virtualization Review

June 20, 2025

AI Art Generation for Primo Models: Exploring Piclumen’s Latest Visual Innovations | AI News Details

June 20, 2025

Former staff claims greed to betray the safety of AI

June 20, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Saturday, June 21
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Improved deduplication of Hug Face Hub parquet
Tools

Improved deduplication of Hug Face Hub parquet

By January 9, 2025Updated:February 13, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email


Di Xiao's Avatar


Hugging Face’s Xet team is working on improving the efficiency of the Hub’s storage architecture so that users can easily and quickly store and update data and models. Hugging Face hosts nearly 11PB of datasets, and Parquet files alone account for over 2.2PB of that storage, so optimizing Parquet storage is a very high priority.

Most Parquet files are bulk exports from various data analysis pipelines or databases, and often appear as complete snapshots rather than incremental updates. Data deduplication becomes important for efficiency if users want to update datasets regularly. Only through deduplication can you store all versions as compactly as possible without having to upload everything again for every update. In an ideal case, you should be able to store all versions of a growing dataset using only slightly more space than the size of the largest version.

The default storage algorithm uses byte-level Content-Defined Chunking (CDC), which is typically much more deduplicated than inserts or deletes, but the Parquet layout presents some challenges. We’ll use a 2GB Parquet file with 1,092,000 rows from the FineWeb dataset and use the deduplication estimation tool to generate a visualization of how some simple changes work with the Parquet file. Run some experiments to confirm.

background

A parquet table works by dividing the table into row groups, with each group containing a fixed number of rows (for example, 1000 rows). Each column within the row group is then compressed and saved.

parquet layout

Intuitively, this means that operations that do not affect row grouping, such as changes or additions, can be deduplicated fairly well. So let’s test this!

addition

We’ll add 10,000 new lines to the file and compare the results to the original version. Green represents all deduplicated blocks, red represents all new blocks, and shading in between indicates different levels of deduplication.

Visualize deduplication by adding data

You can actually deduplicate almost the entire file, but it turns out that only the changes seen at the end of the file can be deduplicated. New files are 99.1% deduplicated and require only 20 MB of additional storage. This agrees well with our intuition.

correction

Given the layout, I would expect the row changes to be fairly separated, but apparently that’s not the case. Now make a small change to line 10000. Although the majority of the files are deduplicated, you will notice that the new data has many small, regularly spaced sections.

Visualize deduplication with data changes

A quick look at the Parquet file format reveals that absolute file offsets are part of Parquet column headers (see ColumnChunk and ColumnMetaData structures). This means that any changes you make can potentially rewrite all column headers. So the data is nicely deduplicated (mostly green), but every column header gets a new byte.

In this case, the new file is only 89% deduplicated and requires 230MB of additional storage.

delete

Here we’ll delete a line from the middle of the file (note: inserts do the same thing). This reorganizes the entire row group layout (each row group is 1000 rows), so you know that while you’re deduplicating the first half of the file, the rest of the file has completely new blocks.

Visualization of deduplication through data deletion

This is primarily because the Parquet format aggressively compresses each column. Turning off compression allows for more aggressive deduplication.

Visualizing deduplication with data removal without column compression

However, if you save the data uncompressed, the file size will be nearly twice as large.

Is it possible to enjoy the benefits of deduplication and compression at the same time?

Content Definition Row Group

One possible solution is to not only use byte-level CDC, but also apply it at the row level. Splits row groups based on the hash of the specified “key” column, rather than the absolute number (1000 rows). That is, we split the rowgroup whenever the key column hash % (target number of rows) = 0, allowing some minimum and maximum rowgroup size.

Here’s a quick hack of an inefficient experimental demonstration.

This allows for efficient deduplication across compressed Parquet files even when rows are deleted. Here you can clearly see the large red blocks representing the rewritten row groups followed by small changes to all column headers.

Visualize deduplication by removing data with content-defined row groups

Based on these experiments, you can consider improving Parquet file deduplication in several ways:

Use relative offsets instead of absolute offsets for file structure data. This makes the Parquet structure location independent and makes it easy to “memcpy”, but it involves changing the file format and is probably difficult to do. Supports chunking of content definitions in row groups. This format currently actually supports this, as it doesn’t require row groups to be uniform in size, so you can do this with minimal explosion radius. Only the Parquet format writer needs to be updated.

While we continue to explore ways to improve the performance of Parquet storage (e.g., could we optionally rewrite Parquet files before uploading? Removing absolute file offsets on upload and restoring them on download? ), we would like to work with the Apache Arrow project to see if we can: I’m interested in implementing some of these ideas in the Parquet/Arrow code base.

In the meantime, we are also investigating the behavior of the data deduplication process on other common file types. We encourage you to try our deduplication estimation tool and let us know your results.

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleScottish AI company develops image-to-sound conversion technology to make life easier for content creators
Next Article Stolen Stardom: How AI and deepfakes are redefining celebrity image rights –

Related Posts

Tools

Former staff claims greed to betray the safety of AI

June 20, 2025
Tools

Public policy to embrace the face

June 19, 2025
Tools

AI adoption will mature, but the deployment hurdles remain

June 19, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

June 5, 20253 Views

Presight plans to expand its AI business internationally

April 14, 20252 Views

PlanetScale Vectors GA: MySQL and AI Database Game Changer

April 14, 20252 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

June 5, 20253 Views

Presight plans to expand its AI business internationally

April 14, 20252 Views

PlanetScale Vectors GA: MySQL and AI Database Game Changer

April 14, 20252 Views
Don't Miss

Only 13% use AI-specific protection, says Wiz – Virtualization Review

June 20, 2025

AI Art Generation for Primo Models: Exploring Piclumen’s Latest Visual Innovations | AI News Details

June 20, 2025

Former staff claims greed to betray the safety of AI

June 20, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?