Hugging Face’s Xet team is working on improving the efficiency of the Hub’s storage architecture so that users can easily and quickly store and update data and models. Hugging Face hosts nearly 11PB of datasets, and Parquet files alone account for over 2.2PB of that storage, so optimizing Parquet storage is a very high priority.
Most Parquet files are bulk exports from various data analysis pipelines or databases, and often appear as complete snapshots rather than incremental updates. Data deduplication becomes important for efficiency if users want to update datasets regularly. Only through deduplication can you store all versions as compactly as possible without having to upload everything again for every update. In an ideal case, you should be able to store all versions of a growing dataset using only slightly more space than the size of the largest version.
The default storage algorithm uses byte-level Content-Defined Chunking (CDC), which is typically much more deduplicated than inserts or deletes, but the Parquet layout presents some challenges. We’ll use a 2GB Parquet file with 1,092,000 rows from the FineWeb dataset and use the deduplication estimation tool to generate a visualization of how some simple changes work with the Parquet file. Run some experiments to confirm.
background
A parquet table works by dividing the table into row groups, with each group containing a fixed number of rows (for example, 1000 rows). Each column within the row group is then compressed and saved.
Intuitively, this means that operations that do not affect row grouping, such as changes or additions, can be deduplicated fairly well. So let’s test this!
addition
We’ll add 10,000 new lines to the file and compare the results to the original version. Green represents all deduplicated blocks, red represents all new blocks, and shading in between indicates different levels of deduplication.
You can actually deduplicate almost the entire file, but it turns out that only the changes seen at the end of the file can be deduplicated. New files are 99.1% deduplicated and require only 20 MB of additional storage. This agrees well with our intuition.
correction
Given the layout, I would expect the row changes to be fairly separated, but apparently that’s not the case. Now make a small change to line 10000. Although the majority of the files are deduplicated, you will notice that the new data has many small, regularly spaced sections.
A quick look at the Parquet file format reveals that absolute file offsets are part of Parquet column headers (see ColumnChunk and ColumnMetaData structures). This means that any changes you make can potentially rewrite all column headers. So the data is nicely deduplicated (mostly green), but every column header gets a new byte.
In this case, the new file is only 89% deduplicated and requires 230MB of additional storage.
delete
Here we’ll delete a line from the middle of the file (note: inserts do the same thing). This reorganizes the entire row group layout (each row group is 1000 rows), so you know that while you’re deduplicating the first half of the file, the rest of the file has completely new blocks.
This is primarily because the Parquet format aggressively compresses each column. Turning off compression allows for more aggressive deduplication.
However, if you save the data uncompressed, the file size will be nearly twice as large.
Is it possible to enjoy the benefits of deduplication and compression at the same time?
Content Definition Row Group
One possible solution is to not only use byte-level CDC, but also apply it at the row level. Splits row groups based on the hash of the specified “key” column, rather than the absolute number (1000 rows). That is, we split the rowgroup whenever the key column hash % (target number of rows) = 0, allowing some minimum and maximum rowgroup size.
Here’s a quick hack of an inefficient experimental demonstration.
This allows for efficient deduplication across compressed Parquet files even when rows are deleted. Here you can clearly see the large red blocks representing the rewritten row groups followed by small changes to all column headers.
Based on these experiments, you can consider improving Parquet file deduplication in several ways:
Use relative offsets instead of absolute offsets for file structure data. This makes the Parquet structure location independent and makes it easy to “memcpy”, but it involves changing the file format and is probably difficult to do. Supports chunking of content definitions in row groups. This format currently actually supports this, as it doesn’t require row groups to be uniform in size, so you can do this with minimal explosion radius. Only the Parquet format writer needs to be updated.
While we continue to explore ways to improve the performance of Parquet storage (e.g., could we optionally rewrite Parquet files before uploading? Removing absolute file offsets on upload and restoring them on download? ), we would like to work with the Apache Arrow project to see if we can: I’m interested in implementing some of these ideas in the Parquet/Arrow code base.
In the meantime, we are also investigating the behavior of the data deduplication process on other common file types. We encourage you to try our deduplication estimation tool and let us know your results.