Dataset usage is exploding, with faces becoming the default home for many datasets. As the amount of datasets uploaded to the hub increases each month, you will need to query, filter and discover them.
A data set created by hugging facehubs every month
I’m very excited to announce that I can directly execute SQL queries on my dataset with my embracing facehub!
Introducing the SQL Console for Datasets
All datasets will display a new SQL console badge. With just one click, you can open the SQL console and query that dataset.
All work is done in the browser and the console comes with some neat features.
100% local: The SQL console features DuckDB WASM, allowing you to query datasets without dependencies. Full DuckDB Syntax: DuckDB has full SQL syntax support and many built-in functions such as Regex, List, JSON, Embeddings. You can see that the DuckDB syntax is very similar to PostgreSQL. Export Results: You can export the results of a query to Parquet. Shareable: Allows you to share query results for public datasets with links.
How it works
Conversion of parquet
Most embracing face datasets are stored in Parquet, a cylindrical data format optimized for performance and storage efficiency. The embracing face and the SQL console dataset viewer loads data directly from the Parquet file of the dataset. Also, if the dataset is in a different format, the first 5GB will be automatically converted to parquet. You can find more information about Parquet Conversion Process in the Dataset Viewer Parquet API documentation.
Using the Parquet file, the SQL console creates views for querying based on dataset splitting and configuration.
duckdb wasm🦆
duckdb wasm is the engine that drives the SQL console. This is an in-process database engine that runs in a web assembly in a browser. No servers or backends are required.
By running it in a browser only, users provide maximum flexibility to query data without dependencies. It’s also very easy to share reproducible results with simple links.
“Does it work for large datasets?” The answer is “Yes!”.
This is a query for the OpenCo7/upvoteweb dataset with 12.6m rows in the Parquet transform.
You can see that you received the results of a simple filter query in less than 3 seconds.
Queries take time based on the size of the dataset and the complexity of the query, but you’ll be surprised at how much you can do with the SQL console.
Like other technologies, there are limitations.
The SQL console works with many queries. However, the memory limit is ~3GB, so you may run out of memory and cannot process the query (hint: try to use a filter to reduce the amount of queries along with the limit along with the query). duckdb wasm is very powerful, but duckdb does not have full functionality. For example, duckdb wasm does not yet support hf:// protocols in datasets.
Example: Convert a dataset from Alpaca to a conversation
Now that we have introduced the SQL console, let’s look at some practical examples. When tweaking large language models (LLM), you often need to work with a variety of data formats. One particularly popular format is the conversational format where each line represents a multi-turn dialog between the user and the model. The SQL console helps you efficiently convert your data to this format. Let’s see how to convert an Alpaca dataset into a conversational format using SQL.
Typically, developers work on this task in Python’s preprocessing steps, but they can show you how to accomplish the same thing in less than 30 seconds using the SQL console.
In the above dataset, click on the SQL Console badge to open the SQL Console. You need to make sure that the following queries are automatically entered:
When you’re ready, click the (Run Query) button to run the query.
SQL
and
source_view As (
Select * from train )
Select
(struct_pack(“from”:= ‘user’“value” := case
when input teeth do not have null and input ! = ”
after that instruction || ‘\ n \ n’ || input
Other than that instruction
end
), struct_pack(“from”:= ‘assistant’“value” := output)) As conversation
from source_view
where instruction teeth do not have null
and output teeth do not have null;
In the query, you use the struct_pack function to create a new struct line for each conversation.
DuckDB has great documentation on struct data types and functions. Many datasets contain columns of JSON data. DuckDB provides the ability to easily parse and query these columns.
Once you have the results, you can download it as a parquet commemorative file. You can see what the final output below looks like:
Please give it a try!
As another example, you can try the SQL console query in SkunkWorksAI/Reasoning-0.01 to see instructions for over 10 inference steps.
SQL Snippets
DuckDB still has many use cases under investigation. I created an SQL snippet space to show you what you can do with the SQL console.
Here are some really interesting use cases we’ve found:
Remember, it’s one click to download the results of SQL as a donation file and use them in your dataset.
I’d like to hear what you think about SQL consoles. If you have any feedback, please comment on this post!
resource