The AI and ML communities share over 180,000 public datasets in their Hugging Face Dataset Hub. Researchers and engineers use these datasets for a variety of tasks, from training LLM to chatting with users to assessing automated speech recognition or computer vision systems. Dataset discoverability and visualization are key challenges to enable AI builders to search, explore and transform the datasets to use.
When we hug our faces, we are building a dataset hub as a place where our communities collaborate on open datasets. So we have created a rich open source ecosystem of tools such as dataset search and dataset viewers, as well as tools. Today we are unveiling four new features that take your hub dataset search to the next level.
Search by modality
The modality of the dataset corresponds to the type of data in the dataset. For example, the most common types of data about hug surfaces are text, image, audio, and tabular data.
We have released a set of filters that can filter datasets with one or more modalities within this list.
Text Image Audio Surface Time Series 3D Video Geography Space
For example, you can search for a dataset that contains both text and image data.
Modalities for each dataset are automatically detected based on the file’s contents and extensions.
Search by size
Recently, we have released a new feature in the interface to display the number of rows for each dataset.
This can then be followed by searching for the dataset in several rows by specifying the minimum and maximum row count. This allows you to search for the largest dataset that exists (for example, the dataset used before LLMS).
Line count information is available for all datasets in supported formats. Even the largest dataset where row counts are not included in the metadata will accurately estimate the total number of rows based on the content of the first 5GB.
For example, if you are looking at a dataset with the most number of rows hugging your face, you can look for a dataset with rows of 10B (1010).
Search by format
The same dataset can be stored in a variety of formats. For example, text datasets are often located in Parquet or JSON lines, but they may be in text files, and image datasets are often a single directory of images, but they may be in WebDataSet format (format based on TAR archives).
Each format has its advantages and disadvantages. For example, Parquet offers nested data support, unlike CSV, efficient filtering/analysis, and good compression ratios, but to access one particular row, you need to decipher the full row group. Another example is a WebDataSet that provides the best data streaming speed but does not have some metadata, such as the number of lines per file. This is often necessary to efficiently distribute data in a multi-node training setup.
Therefore, the dataset format indicates which use cases are preferred and whether the data needs to be reformatted to suit your needs.
Here you can see the dataset in webdataset format.
Search in the library
There are many great libraries and tools to load your datasets and prepare you for training such as Pandas, Dusks, 🤗 Dataset Library. The hub allows you to use your favorite tools and use filter datasets that are compatible with your library. For example, you can search for datasets that are compatible with pandas.
Dataset compatibility is based on dataset format and size (for example, Dask can load Big JSON Lines datasets, unlike Pandas, which requires loading the full dataset into memory). In addition to this, we also provide code snippets to load datasets into your favorite tools.
If you want to display your libraries in the list of supported libraries, please explain about huggingface.js!
Merge the filters
These four new dataset search tools can be used with other existing filters such as language, tasks, and licenses. These filters can be combined with the text search bar to find the specific dataset you are looking for.