Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

A collaborative effort to maintain application resilience

June 17, 2025

Samsung R&D Institute, IIT Madras signs MOU to promote research in AI such as Indian language, HealthTech | Education

June 17, 2025

Pentagon Awards Openai $200 million AI contract for national security

June 17, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Tuesday, June 17
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Announcement of new dataset search capabilities
Tools

Announcement of new dataset search capabilities

versatileaiBy versatileaiApril 12, 2025No Comments4 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

The AI ​​and ML communities share over 180,000 public datasets in their Hugging Face Dataset Hub. Researchers and engineers use these datasets for a variety of tasks, from training LLM to chatting with users to assessing automated speech recognition or computer vision systems. Dataset discoverability and visualization are key challenges to enable AI builders to search, explore and transform the datasets to use.

When we hug our faces, we are building a dataset hub as a place where our communities collaborate on open datasets. So we have created a rich open source ecosystem of tools such as dataset search and dataset viewers, as well as tools. Today we are unveiling four new features that take your hub dataset search to the next level.

Search by modality

The modality of the dataset corresponds to the type of data in the dataset. For example, the most common types of data about hug surfaces are text, image, audio, and tabular data.

We have released a set of filters that can filter datasets with one or more modalities within this list.

Text Image Audio Surface Time Series 3D Video Geography Space

For example, you can search for a dataset that contains both text and image data.

Modalities for each dataset are automatically detected based on the file’s contents and extensions.

Search by size

Recently, we have released a new feature in the interface to display the number of rows for each dataset.

Number of rows in each dataset

This can then be followed by searching for the dataset in several rows by specifying the minimum and maximum row count. This allows you to search for the largest dataset that exists (for example, the dataset used before LLMS).

Line count information is available for all datasets in supported formats. Even the largest dataset where row counts are not included in the metadata will accurately estimate the total number of rows based on the content of the first 5GB.

For example, if you are looking at a dataset with the most number of rows hugging your face, you can look for a dataset with rows of 10B (1010).

The largest data set

Search by format

The same dataset can be stored in a variety of formats. For example, text datasets are often located in Parquet or JSON lines, but they may be in text files, and image datasets are often a single directory of images, but they may be in WebDataSet format (format based on TAR archives).

Each format has its advantages and disadvantages. For example, Parquet offers nested data support, unlike CSV, efficient filtering/analysis, and good compression ratios, but to access one particular row, you need to decipher the full row group. Another example is a WebDataSet that provides the best data streaming speed but does not have some metadata, such as the number of lines per file. This is often necessary to efficiently distribute data in a multi-node training setup.

Therefore, the dataset format indicates which use cases are preferred and whether the data needs to be reformatted to suit your needs.

Here you can see the dataset in webdataset format.

webdatasets

Search in the library

There are many great libraries and tools to load your datasets and prepare you for training such as Pandas, Dusks, 🤗 Dataset Library. The hub allows you to use your favorite tools and use filter datasets that are compatible with your library. For example, you can search for datasets that are compatible with pandas.

PANDAS compatible data sets

Dataset compatibility is based on dataset format and size (for example, Dask can load Big JSON Lines datasets, unlike Pandas, which requires loading the full dataset into memory). In addition to this, we also provide code snippets to load datasets into your favorite tools.

Load FineWeb-Edu into Dask

If you want to display your libraries in the list of supported libraries, please explain about huggingface.js!

Merge the filters

These four new dataset search tools can be used with other existing filters such as language, tasks, and licenses. These filters can be combined with the text search bar to find the specific dataset you are looking for.

Search for WebDataset for PDF Images

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleStreann Media revolutionizes by unveiling creator AI agents
Next Article 1minai is $79.97 (Reg. $540) until April 30th.
versatileai

Related Posts

Tools

A collaborative effort to maintain application resilience

June 17, 2025
Tools

GROQ hugging face reasoning provider

June 17, 2025
Tools

Ericsson and AWS bet on AI to create self-healing networks

June 16, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

June 5, 20253 Views

Presight plans to expand its AI business internationally

April 14, 20252 Views

PlanetScale Vectors GA: MySQL and AI Database Game Changer

April 14, 20252 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

June 5, 20253 Views

Presight plans to expand its AI business internationally

April 14, 20252 Views

PlanetScale Vectors GA: MySQL and AI Database Game Changer

April 14, 20252 Views
Don't Miss

A collaborative effort to maintain application resilience

June 17, 2025

Samsung R&D Institute, IIT Madras signs MOU to promote research in AI such as Indian language, HealthTech | Education

June 17, 2025

Pentagon Awards Openai $200 million AI contract for national security

June 17, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?