Bring large datasets to “lerobot”

tl;drToday we are releasing lerobotdataset:v3! In previous LerobotDataset:V2 releases, you saved one episode per file and pressed filesystem limits when scaling datasets to millions of episodes. lerobotdataset: V3 packs multiple episodes into a single file to retrieve information at the individual episode level from multi-episode files using relational metadata. The new format natively supports access to datasets in streaming mode, allowing large datasets to be processed on the fly. We provide a one-liner UTIL for converting all datasets in the LerobotDataset format into new formats and are extremely excited to share this milestone with the community ahead of the next stable release.

lerobotdataset, v3.0

lerobotdataset is a standardized dataset format designed to address specific needs in robot learning, providing unified and convenient access to robotic data across modalities, including sensorimotor measurements, multiple camera feeds, and teleooperation status. The dataset format also stores general information about how data is collected (metadata), such as a description of the text of the task being run, the type of robot being used, and measurement details such as frames per second where both the image and robotic state streams are sampled. Metadata helps you index and search across the robot dataset of the facehub to hug!

The Lerobotics Library provides a unified interface for manipulating multimodal time series data within a face-hugging robotics library, seamlessly integrating both Hugging Face and Pytorch Ecosystems. The dataset format is designed to be easily scalable and customizable from a wide range of embodiments, including SO-100 ARMS and Aloha-2 setups, real-world humanoid data, simulation data sets, and even manipulator platforms such as self-driving vehicle data, supporting datasets already available from a wide range of implementations. You can use the Dataset Visualizer to explore the current datasets that the community has contributed! 🔗

In addition to scale, this new release of LerobotDataset can also enable support for streaming features, allowing you to process batches of data from large datasets on the fly. You can access and use V3.0 datasets in streaming mode using the dedicated StreaminglerobotDataset interface. The streaming dataset is an important milestone for more accessible robot learning and we are excited to share it with our community.

From episode-based to file-based datasets

Streaminglerobotdataset — Enables dataset streaming directly from the facehub to hug for fly processing.

Install Lerobot and record the dataset

Lerobot is an end-to-end robotics library developed with Face, supporting real-world robotics and cutting-edge robot learning algorithms. The library allows you to record datasets directly to real-world robots and store them in the face hub of a hug. Click here to learn more about the robots we currently support.

Lerobotdataset: V3 is set to become part of the Lerobot Library starting with Lerobot-V0.4.0 and we are extremely excited to share it with the community early. You can install the latest Lerobot-V0.3.X, which supports this new dataset format directly from Pypi using:

Pip install “https://github.com/huggingface/lerobot/archive/33cad37054c2b594ceba57463e8f11ee374fa93c.zip”

Follow the community’s progress towards a stable release of the library here 🤗

Once you have installed a version of Lerobot that supports the new dataset format, you can use teleoperation along with the following steps to record the dataset on the SO-101, the signing robot arm.

lerobot-record \ – robot.type = so101_follower \ – robot.port =/dev/tty.usbmodem585a0076841 \ – robot.id = my_awesome_follower_arm \ – robot.cameras =“{front:{type:opencv, index_or_path:0, width:1920, height:1080, fps:30}}” \ -TELEOP.TYPE = SO101_LEADER \ -TELEOP.PORT =/DEV/TTY.USBMODEM58760431551 \ -TELEOP.ID = MY_AWEOSE_LEADER_ARM \ -DISPLAY_DATA =truth \ -dataset.repo_id =${hf_user}/RECORD-TEST \ -DATASET.NUM_EPISODES = 5 \ –DATASET.SINGLE_TASK =“Get a black cube”

Visit the official documentation to find out how to record your use case dataset.

The core design choice behind LerobotDataset separates the underlying data storage from the user-facing API. This allows for efficient serialization and storage while presenting data in an intuitive, ready-to-use format. The dataset is organized into three main components:

Tabular data: Low-dimensional high-frequency data such as collaborative states, and actions are stored in efficient Apache Parquet files, usually offloaded to a more mature dataset library, providing fast, memory-mapped or streaming-based access. Visual Data: To process large amounts of camera data, frames are concatenated and encoded into MP4 files. Frames from the same episode are always grouped into the same video, and multiple videos are grouped by cameras. To reduce file system stress, groups of videos in the same camera view are also divided into multiple subdirectories. Metadata: A collection of JSON files that describe the structure of a dataset from a metadata perspective, serving as relational counterparts for both tabular and visual dimensions of the data. Metadata includes various functional schemas, frame rates, normalization statistics, and episode boundaries.

To support datasets with potentially millions of episodes (resulting in hundreds of millions of individual frames), we merge data from different episodes into the same high-level structure. Specifically, this means that a particular table collection and video does not contain information about only one episode, but includes a concatenation of information available for multiple episodes. This allows it to be managed both by local file systems and by remote storage providers that look like face-clap. You can then use the metadata to collect episode-specific information. For example, timestamps that start or end on a particular video.

The dataset is organized as a repository that includes:

Meta/Info.json: This is the central metadata file. It contains a complete dataset schema and defines all functions (for example, observation.state, actions), their shape, and data type. It also stores important information such as frames per frame (FPS) in the dataset, codebase versions, and path templates used to find data and video files. Meta/Stats.JSON: This file stores aggregate statistics (average, STD, MIN, MAX) for each feature across the dataset. These are used to normalize the data and are accessible from DataSet.meta.stats. Meta/tasks.jsonl: Contains mapping from natural language task description to integer task index. This is used for task-conditioned policy training. Meta/Episode/: This directory contains metadata about an individual episode, such as its length, corresponding tasks, and a pointer to where the data is stored. For scalability, this information is stored in a chunked Parquet file rather than in a single large JSON file. Data/: The parquet file contains tabular data for each core frame. Data from multiple episodes is concatenated into larger files to improve performance and process large datasets. These files are organized into chunked subdirectories to make file sizes easier to manage. Therefore, a single file usually contains data from multiple episodes. Video/: Contains MP4 video files for all visual observation streams. Similar to the data/directory, video footage from multiple episodes is concatenated into a single MP4 file. This strategy significantly reduces the number of files in the dataset. This is more efficient than modern file systems. The path structure (/videos///file_…mp4) allows the data loader to find the correct video file and look for the exact timestamp for a particular frame.

Migrate v2.1 dataset to v3.0

lerobotdataset:v3.0 was released with lerobot-v0.4.0 and could easily convert a dataset currently hosted on the facehub you’re currently hugging, into a new v3.0.

python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 – Repo-id =

We are extremely excited to share this new format with the community early! While developing Lerobot-V0.4.0, you can use the latest Lerobot-V0.3.x to support this new dataset format directly from Pypi and convert your dataset to a newly updated version using:

Pip install “https://github.com/huggingface/lerobot/archive/33cad37054c2b594ceba57463e8f11ee374fa93c.zip”
python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 – Repo -id =

Note that this is pre-release and is generally an unstable version. You can follow the development status of the following stable releases:

Conversion script Convert_Dataset_v21_TO_V30.py Multiple episodes episodes-0000.MP4, episode-0001.mp4, episode-0002.mp4, …/Episode-0000.Parquet, episode-0001.Parquet, episode-0002. Update the metadata accordingly so that episode-specific information can be retrieved from the high-level files.

Code example: Use lerobotdataset with Torch.utils.data.dataloader

All datasets of the embracing facehub, including the three main pillars above (tabular and visual data, and relational metadata), and can be accessed in a single row.

Most robot learning algorithms based on reinforcement learning (RL) or behavioral cloning (BC) tend to work with stacks of observation and actions. For example, RL algorithms usually use history of previous observations o_{t-h_o:t}, and BC algorithms are usually trained to regress chunks of multiple actions. To accommodate the details of robot learning training, LerobotDataset provides native windowing operations and allows you to use seconds before and after a particular observation using the DELTA_TIMESTAMPS argument.

Conveniently, using Pytorch Dataloader to use Lerobotdataset allows you to automatically match individual sample dictionaries from your dataset to a single dictionary in a batch tensor.

from lerobot.datasets.lerobot_dataset Import lerobotdataset repo_id = “yaak-ai/l2d-v3”

Dataset = lerobotdataset(repo_id)sample = dataset(100))
printing(Sample) delta_timestamps = {
“observation.images.front_left”:(-0.2–0.1, 0.0)} dataset = lerobotdataset(repo_id delta_timestamps = delta_timestamps) sample = dataset(100))

printing(sample(‘Observation.images.front_left’).shape)batch_size =16

data_loader = torch.utils.data.dataloader(dataset, batch_size = batch_size) num_epochs = 1
Device= “cuda” if torch.cuda.is_available() Other than that “CPU”

for epoch in range(num_epochs):
for batch in data_loader: Observation = batch (‘oververation.state.vehicle’).to(device)action = batch(“Action.continuous”).to(device)images = batch(‘Observation.images.front_left’).to(device)…

Streaming

You can also use the StreaminglerobotDataset class to use any dataset in streaming mode in v3.0 format without downloading locally.

from lerobot.datasets.streaming_dataset Import streaminglerobotdataset repo_id = “yaak-ai/l2d-v3”
Dataset = streaminglerobotdataset(repo_id)

Conclusion

Lerobotdataset v3.0 is a stepping stone to scaling up the robot datasets supported by Lerobot. By providing a format for storing and accessing large collections of robotics data, we advance towards the democratization of robotics, allowing communities to train on perhaps millions of episodes without downloading the data itself!

You can install the latest Lerobot-V0.3.x and try out the new dataset format and share feedback on your GitHub or Discord server. 🤗

Acknowledgments

LerobotDataset: We would like to thank the fantastic Yaak.ai team for providing valuable support and feedback while developing the V3. Go ahead and follow the organization with the Hug Hub! We are constantly working with our community and trying to share early features. If you want to cooperate, please reach out

versatileai

See Full Bio

What's Hot

The future of physical AI revealed in the LG and NVIDIA meeting

How to build scalable web apps using OpenAI privacy filters

Per-token AI fees coming to GitHub Copilot

The future of physical AI revealed in the LG and NVIDIA meeting

How to build scalable web apps using OpenAI privacy filters

Per-token AI fees coming to GitHub Copilot

DeepInfra on Hug Face Inference Provider 🔥

Soulgen revolutionizes the creation of NSFW content

Per-token AI fees coming to GitHub Copilot

Most Popular

DeepInfra on Hug Face Inference Provider 🔥

Soulgen revolutionizes the creation of NSFW content

Per-token AI fees coming to GitHub Copilot

Don't Miss

The future of physical AI revealed in the LG and NVIDIA meeting

How to build scalable web apps using OpenAI privacy filters

Per-token AI fees coming to GitHub Copilot

Subscribe to Updates

What's Hot

Bring large datasets to “lerobot”

table of contents

lerobotdataset, v3.0

Install Lerobot and record the dataset

Migrate v2.1 dataset to v3.0

Code example: Use lerobotdataset with Torch.utils.data.dataloader

Streaming

Conclusion

Acknowledgments

Related Posts