Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

OpenAI reinvents itself and enters “next chapter” of partnership with Microsoft

October 29, 2025

Adobe has added an “artificial intelligence (AI) assistant” to Photoshop. Apart from the one-way structure.

October 29, 2025

US AI company defies EU with ‘massive facial recognition scraping operation’

October 28, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Wednesday, October 29
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Announcement of new dataset search capabilities
Tools

Announcement of new dataset search capabilities

versatileaiBy versatileaiApril 12, 2025No Comments4 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

The AI ​​and ML communities share over 180,000 public datasets in their Hugging Face Dataset Hub. Researchers and engineers use these datasets for a variety of tasks, from training LLM to chatting with users to assessing automated speech recognition or computer vision systems. Dataset discoverability and visualization are key challenges to enable AI builders to search, explore and transform the datasets to use.

When we hug our faces, we are building a dataset hub as a place where our communities collaborate on open datasets. So we have created a rich open source ecosystem of tools such as dataset search and dataset viewers, as well as tools. Today we are unveiling four new features that take your hub dataset search to the next level.

Search by modality

The modality of the dataset corresponds to the type of data in the dataset. For example, the most common types of data about hug surfaces are text, image, audio, and tabular data.

We have released a set of filters that can filter datasets with one or more modalities within this list.

Text Image Audio Surface Time Series 3D Video Geography Space

For example, you can search for a dataset that contains both text and image data.

Modalities for each dataset are automatically detected based on the file’s contents and extensions.

Search by size

Recently, we have released a new feature in the interface to display the number of rows for each dataset.

Number of rows in each dataset

This can then be followed by searching for the dataset in several rows by specifying the minimum and maximum row count. This allows you to search for the largest dataset that exists (for example, the dataset used before LLMS).

Line count information is available for all datasets in supported formats. Even the largest dataset where row counts are not included in the metadata will accurately estimate the total number of rows based on the content of the first 5GB.

For example, if you are looking at a dataset with the most number of rows hugging your face, you can look for a dataset with rows of 10B (1010).

The largest data set

Search by format

The same dataset can be stored in a variety of formats. For example, text datasets are often located in Parquet or JSON lines, but they may be in text files, and image datasets are often a single directory of images, but they may be in WebDataSet format (format based on TAR archives).

Each format has its advantages and disadvantages. For example, Parquet offers nested data support, unlike CSV, efficient filtering/analysis, and good compression ratios, but to access one particular row, you need to decipher the full row group. Another example is a WebDataSet that provides the best data streaming speed but does not have some metadata, such as the number of lines per file. This is often necessary to efficiently distribute data in a multi-node training setup.

Therefore, the dataset format indicates which use cases are preferred and whether the data needs to be reformatted to suit your needs.

Here you can see the dataset in webdataset format.

webdatasets

Search in the library

There are many great libraries and tools to load your datasets and prepare you for training such as Pandas, Dusks, 🤗 Dataset Library. The hub allows you to use your favorite tools and use filter datasets that are compatible with your library. For example, you can search for datasets that are compatible with pandas.

PANDAS compatible data sets

Dataset compatibility is based on dataset format and size (for example, Dask can load Big JSON Lines datasets, unlike Pandas, which requires loading the full dataset into memory). In addition to this, we also provide code snippets to load datasets into your favorite tools.

Load FineWeb-Edu into Dask

If you want to display your libraries in the list of supported libraries, please explain about huggingface.js!

Merge the filters

These four new dataset search tools can be used with other existing filters such as language, tasks, and licenses. These filters can be combined with the text search bar to find the specific dataset you are looking for.

Search for WebDataset for PDF Images

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleStreann Media revolutionizes by unveiling creator AI agents
Next Article 1minai is $79.97 (Reg. $540) until April 30th.
versatileai

Related Posts

Tools

OpenAI reinvents itself and enters “next chapter” of partnership with Microsoft

October 29, 2025
Tools

Streaming datasets: 100x more efficient

October 28, 2025
Tools

Lightricks’ open source AI video delivers 4K, sound, and fast rendering

October 27, 2025
Add A Comment

Comments are closed.

Top Posts

Lightricks’ open source AI video delivers 4K, sound, and fast rendering

October 27, 20253 Views

OpenAI acquires AI Mac Interface and Sky

October 24, 20253 Views

Co-building an open agent ecosystem: Introducing OpenEnv

October 23, 20253 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Lightricks’ open source AI video delivers 4K, sound, and fast rendering

October 27, 20253 Views

OpenAI acquires AI Mac Interface and Sky

October 24, 20253 Views

Co-building an open agent ecosystem: Introducing OpenEnv

October 23, 20253 Views
Don't Miss

OpenAI reinvents itself and enters “next chapter” of partnership with Microsoft

October 29, 2025

Adobe has added an “artificial intelligence (AI) assistant” to Photoshop. Apart from the one-way structure.

October 29, 2025

US AI company defies EU with ‘massive facial recognition scraping operation’

October 28, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?