Streaming datasets: 100x more efficient

Enhanced load_dataset(‘dataset’, streaming=True) to stream a dataset without downloading it in one line of code.

You can immediately start training on multi-TB datasets without complicated setups, downloads, “out of disk space” or “Please stop requests!”. error.
It’s super fast! 256 workers download data and train on 64xH100, which outperforms local SSD. Improved streaming with 100x fewer requests, 10x faster data resolution, 2x more samples per second, and 0 worker crashes with 256 concurrent workers.

Loading data, especially terabytes of data, is a huge pain in machine learning workflows. I encountered this issue while training SmolLM3 and at one point had to wait 3 hours before each run to download enough data.

Streaming has always been possible with dataset libraries, but training at scale with large datasets remained a challenge. That changes today 🔥. We spent several months improving the backend, making it faster and more efficient with a focus on streaming datasets.

What exactly did we do? ⤵️

Streaming: Same simple API

First of all, the changes are backward compatible. You can stream any dataset from your hub using the same simple streaming=True flag. It’s as easy as ever. 🚀

from dataset import Load DatasetDataset = LoadDataset(“Hug Face M4/Fine Vision Max”split =“train”streaming =truth)

print(Next(Iter(dataset)))

Thousands of AI developers around the world use datasets every day. It should improve performance without any extra work.

Challenge: Streaming at scale

Streaming has been a savior for quickly understanding datasets, but training models typically involved downloading data locally or using cloud storage services like S3. That’s what we were doing for SmolVLM training, all of our data was on S3 and streaming directly from there.

We wanted to change this, so we decided to use streaming from a hub when developing nanoVLM. We quickly discovered a major problem. The test run generated over 100,000 requests within 1 minute and the IP was blocked by the hub. 😅 This happened because all DataLoader workers were initializing their datasets individually. Upon further investigation, I discovered that this resulted in a large number of redundant requests, many of which were unnecessary. Our changes ultimately reduced launch requests by a factor of 100. In total, our improvements yielded the following benefits:

Data file resolution time: 10x faster Launch requests: Up to 100x more efficient Streaming speed: Up to 2x faster Running requests: Up to 2x more efficient

Under the hood: What’s improved

So what has changed? We focused on two phases: startup and streaming.

1. Launch ⚡️ Initial resolution of data files created a large number of requests. We’ve made two major changes:

Persistent data file caching: We currently cache the list of data files across all DataLoader workers. The first worker resolves the file list from the hub. All other workers read directly from this local cache, effectively eliminating startup requests and significantly reducing resolution time. No more requests! Optimized resolution logic: We also minimized the number of API calls required for the first worker to fetch the file list. We now bundle the necessary requests as efficiently as possible to further reduce latency.

2. Streaming 🏎️ We’ve introduced two new features to improve the throughput of streaming itself.

Parquet prefetching: Prefetching for Parquet datasets is now enabled. This means that while the model is processing the current chunk of data, the dataset library is already fetching the next chunk in the background. This keeps the data pipeline full and prevents the GPU from being left waiting for data. Configurable buffering: Advanced users can now fine-tune streaming performance to suit their specific hardware and network settings. Exposes options to configure buffer block size and prefetch volume, providing maximum control to optimize I/O.

This is how to increase the minimum request size when streaming from 32MiB (default) to 128MiB and configure prefetching.

import pie arrow
import pyarrow.dataset flagment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions( prefetch_limit=1range_size_limit=128 20 ), ) ds =load_dataset(parquet_dataset_id,streaming=)truthfragment_scan_options=fragment_scan_options)

These improvements double your data throughput, allowing you to train faster and more efficiently.

Regular S3: How is it faster than Xet?

Hugging Face uses Xet, a deduplication-based storage that enables fast deduplicated uploads and downloads. Unlike traditional remote storage, Xet transfers duplicate data only once, resulting in faster data transfers. For example, to upload large datasets to Hugging Face, leverage Xet to speed up uploads. Once your dataset is uploaded, you can stream it immediately.

Parquet deduplication is enabled through Parquet Content Defined Chunking (CDC). Thanks to Parquet CDC and Xet deduplication, uploading datasets to Hugging Face is faster than traditional remote storage.

This is supported by the pyspark_huggingface package, a Spark data source for reading and writing HF datasets. This includes support for Parquet CDC and Xet, dramatically speeding up data transfer over HF.

Need a custom streaming pipeline?

Some data file formats aren’t supported by datasets and you might want more control, so we’ve made it easy to build custom streaming pipelines. It has been thoroughly tested with the LeRobot library for sampling video frames and the WebDataset library for streaming TAR archives.

Improved HfFileSystem in the huggingface_hub library to efficiently read files from remote Hugging Face dataset repositories and stream data.

from hug face hub import HfFileSystem path = f”hf://dataset/{dataset_id}/{path within repository}”
and HfFileSystem().open(path) as f:

Passing an HfFileSystem to the Torch DataLoader reuses the cached results from .ls() and .glob(), eliminating the need for additional requests when listing data files.

Pushing streaming to its limits

We are currently using these streaming enhancements in nanoVLM to train the next generation of SmolVLM. These adjustments resulted in better performance for streaming than for training on the cluster’s tiered hard disk setup. In fact, streaming is now as fast as reading data from a local SSD. Previously, the process of transferring data to the local SSD delayed training by 3 hours. For more information, please visit GitHub.

Get started and see the difference

These powerful new features have been incorporated into our dataset and huggingface_hub library. To take advantage of these, update your library and check the documentation.

pip install –upgrade dataset

To celebrate, we’ve pre-concatenated and shuffled all FineVision data sources into FineVisionMax. This single combined dataset can be used to train the VLM. No need to manually process multiple datasets.

from dataset import Load DatasetDataset = LoadDataset(“Hug Face M4/Fine Vision Max”split =“train”streaming=truth)

print(Next(Iter(dataset)))

And you’ll see how to do it at scale with nanoVLM.

Enjoy streaming! 🤗

What's Hot

Streaming datasets: 100x more efficient

Jenny Lee of Granite Asia and Leslie Teo of AI Singapore join the Design AI and Tech Awards judging panel; Design AI and Tech Awards

AI Due Diligence in Healthcare Transactions | Shepard Mullin Richter & Hampton LLP

Lightricks’ open source AI video delivers 4K, sound, and fast rendering

Anthropic’s $1 billion TPU expansion signals strategic change for enterprise AI infrastructure

Hugging Face and VirusTotal team up to power AI security

WhatsApp blocks AI chatbots to protect business platform

Paris AI Safety Breakfast #3: Yoshua Bengio

Lightricks’ open source AI video delivers 4K, sound, and fast rendering

Most Popular

WhatsApp blocks AI chatbots to protect business platform

Paris AI Safety Breakfast #3: Yoshua Bengio

Lightricks’ open source AI video delivers 4K, sound, and fast rendering

Don't Miss

Streaming datasets: 100x more efficient

Jenny Lee of Granite Asia and Leslie Teo of AI Singapore join the Design AI and Tech Awards judging panel; Design AI and Tech Awards

AI Due Diligence in Healthcare Transactions | Shepard Mullin Richter & Hampton LLP

Subscribe to Updates

What's Hot

Streaming datasets: 100x more efficient

Streaming: Same simple API

Challenge: Streaming at scale

Under the hood: What’s improved

Regular S3: How is it faster than Xet?

Need a custom streaming pipeline?

Pushing streaming to its limits

Get started and see the difference

Related Posts