If you’re working on data-intensive research or machine learning projects, you need a reliable way to share and host your datasets. Public datasets such as Common Crawl, ImageNet, and Common Voice are important to the open ML ecosystem, but can be difficult to host and share.
Face Hub makes hosting and sharing datasets seamless and is trusted by many leading research institutions, businesses, and government agencies, including Nvidia, Google, Stanford, NASA, THUDM, and the Barcelona Supercomputing Center .
Host your dataset on Hugging Face Hub and get instant access to features that will help you maximize your work effectiveness.
generous limits
Support for large datasets
The hub can host terabyte-sized datasets with high limits per file and per repository. If you have data to share, the Hugging Face Dataset team can help suggest the best format to upload your data for use by the community. 🤗 Dataset libraries make it easy to upload and download files, and even create datasets from scratch. 🤗 Datasets also enables dataset streaming, allowing you to work with large datasets without downloading the entire thing. This is invaluable for allowing researchers with low computational resources to manipulate datasets or select small portions of large datasets for testing, development, and prototyping.
Hugging Face Hub can host large datasets often created for machine learning research.
Note: The Xet team is currently working on backend updates that will increase the per-file limit from the current 50 GB to 500 GB while also improving storage and transfer efficiency.
Dataset viewer
Hubs not only host data, they also provide powerful tools for exploration. Dataset Viewer allows users to explore and interact with datasets hosted on the hub directly in the browser. This provides an easy way for others to view and explore your data without having to download it first.
The Hugging Face dataset supports a variety of modalities (audio, image, video, etc.), file formats (CSV, JSON, Parquet, etc.), and compression formats (Gzip, Zip, etc.). For more information, see the Dataset File Formats page.
Dataset viewer for the Infinity-Instruct dataset.
The dataset viewer also includes several features that make it easier to explore datasets.
Full text search
Built-in full-text search is one of the dataset viewer’s most powerful features. Text columns in your dataset are immediately searchable.
The Arxiver dataset contains 63.4k lines of arXiv research papers converted to Markdown. Full-text search makes it easy to find articles with specific authors, such as Ilya Sutskever below.
sorting
The dataset viewer allows you to sort the dataset by clicking the column headers. This makes it easy to find the most relevant examples in your dataset.
Below is an example of a dataset sorted in descending order by the Helpfulness column from the HelpSteer2 dataset.
Third party library support
Hugging Face is fortunate to be able to integrate with third-party and leading open source data tools. By hosting your dataset on a hub, your dataset is instantly compatible with the tools your users are most familiar with.
Below are some of the libraries that Hugging Face supports out of the box.
Library Description Monthly PyPi Downloads (2024) Pandas Python Data Analysis Toolkit. 258M Spark A real-time large-scale data processing tool in a distributed environment. 29M Datasets 🤗 Datasets is a library for accessing and sharing audio, computer vision, and natural language processing (NLP) datasets. 17M Dask A parallel and distributed computing library that extends the existing Python and PyData ecosystem. DataFrame library on the 12M Polars OLAP query engine. 8.5M DuckDB In-process SQL OLAP database management system. 6M WebDataset library for creating I/O pipelines for large datasets. 871K Argilla Collaboration tools for AI engineers and domain experts who value high-quality data. 400k
Most of these libraries allow you to load or stream datasets with a single line of code.
Below are some examples using Pandas, Polars, and DuckDB.
import panda as pd df = pd.read_parquet(“hf://datasets/neuralwork/arxiver/data/train.parquet”)
import polar region as pl df = pl.read_parquet(“hf://datasets/neuralwork/arxiver/data/train.parquet”)
import duckdb duckdb.sql(“SELECT * FROM ‘hf://datasets/neuralwork/arxiver/data/train.parquet’ LIMIT 10”)
For more information about the integrated library, see the dataset documentation. In addition to the libraries listed above, there are a number of community-supported tools that support Hugging Face Hub, including Lilac and Spotlight.
SQL console
SQL Console provides an interactive SQL editor that runs completely within your browser, allowing you to explore your data instantly without any setup. The main features are:
One-click: Open the SQL console and query your dataset with one click. Shareable and embeddable results: Share and embed interesting query results. Complete DuckDB syntax: Use full SQL syntax, including regular expressions, lists, JSON, embedding, and built-in functions. more
You should see the new SQL Console badge on all public datasets. Open the SQL console and query that dataset with just one click.
safety
While it’s important to have access to datasets, it’s equally important to protect sensitive data. Hugging Face Hub offers robust security features that help you maintain control of your data while sharing it with the right people.
access control
Hugging Face Hub supports unique access control options for who can access datasets.
Public: Anyone can access the dataset. Private: Only you and members of your organization can access the dataset. Gated: Control access to datasets with two options: Automatic approval: Users must provide required information (such as name and email) and agree to terms before gaining access . Manual approval: Review each access request and manually approve/deny it.
For more information about gated datasets, see the gated dataset documentation. For more granular control, there are features in the Enterprise plan that allow organizations to create resource security groups and use SSO.
Built-in security scan
In addition to access control, Hugging Face Hub offers several security scanners.
Feature Description Malware Scan Scans files for malware and suspicious content on every commit and visit. Secret Scan Block datasets using hard-coded secrets and environment variables. Pickle Scan Scans Pickle files and displays vetted imports of PyTorch weights. Pickle, Keras, and other exploits using ProtectAI Guardian technology

Reach and visibility
While having a secure platform with powerful features is valuable, the true impact of your research comes from reaching the right audience. Reach and visibility are critical for researchers sharing datasets. This maximizes the impact of research, enables reproducibility, fosters collaboration, and ensures that valuable data benefits the broader scientific community.
With over 5 million builders actively using the platform, Hugging Face Hub provides researchers with a powerful tool for community engagement and visibility. Here’s what you can expect:
Better community participation
Built-in discussion tab for each dataset for community engagement Organization as a central place to group and collaborate on multiple datasets Metrics about dataset usage and impact
wider range
Access to a large and active community of researchers, developers, and practitioners SEO-optimized URLs that make it easy to discover datasets Integration with a broad ecosystem of models, datasets, and libraries Data Clear links between sets and associated models, papers, and demos
Improved documentation
Customizable README file for comprehensive documentation Detailed dataset description and support for appropriate academic citations Links to related research papers and publications
The hub makes it easy to ask questions and discuss datasets.
How do I host a dataset on Hugging Face Hub?
Now that you understand the benefits of hosting your datasets on a hub, you may be wondering how to get started. Here are some comprehensive resources to guide you through this process.
If you want to share large datasets, the following pages may be helpful.
If you need further assistance uploading your dataset to the hub, or if you would like to upload a particularly large dataset, please contact us at datasets@huggingface.co.