Used in NVIDIA ISAAC GR00T N1 object operations.
At the annual GTC conference, Nvidia announced a trio of groundbreaking open source releases aimed at accelerating physical AI development. Release of a new suite of World Foundation Models (WFMS) with a multi-control called Cosmos Transfer, a highly curated physical AI dataset, and the first open model of popular humanoid inference called NVIDIA ISAAC GR00T N1 – representing a significant leap in physical AI technology, providing developers, promoting robotic systems, and enhancing automatic vehicle technology.
New World Foundation Model – Cosmos Transfer
The latest addition to the Nvidia’s Cosmos™ World Foundation model (WFMS), Cosmos Transfer introduces a new level of control and accuracy when generating scenes for virtual worlds.
Available in 7 billion parameter sizes, the model utilizes multi-control to derive high-fidelity world scene generation from structural inputs, ensuring accurate spatial alignment and scene composition.
How it works
This model is constructed by individually training the individual control nets for each sensor modality used to capture the simulated world.
Input types include 3D bounding box maps, trajectory maps, depth maps, and segmentation maps.
During inference, developers can use a variety of input types including structured visual or geometric data such as segmentation maps, depth maps, edge maps, human motion keypoints, LIDAR scans, trajectories, HD maps, and 3D bounding boxes. The control signals from each control branch are multiplied by a corresponding adaptive time point control map and summed before they are added to the transformer block of the base model. The generated output is a photorealistic video sequence with controlled layout, object placement, and movement. Developers can control output in multiple ways, including preserving structure and appearance and allowing appearance variations while preserving its structure.
Output from space transfers a variety of environments and weather conditions.
COSMOS transfer combined with the Nvidia Omniverse platform promotes controllable synthetic data generation for scale robotics and autonomous vehicle development. Find more examples of COSMOS transfers on GitHub.
COSMOS transfer samples constructed using the base model after training are also available in self-driving cars.
Open the physical AI dataset
Nvidia has also released a Physical AI Dataset, an open source dataset that embraces the face for developing physical AI. This commercial grade pre-validated dataset consists of 15 terabytes of data representing over 320,000 trajectories for robotics training, plus up to 1,000 Universal Scene Description (OpenUSD) assets.
The dataset is designed for post-training basic models, such as the COSMOS Predict World Foundation model, providing developers with high quality, diverse data to enhance AI models.
Humanoid Purpose Construction Model-NVIDIAISAAC GR00T N1
Another exciting announcement is the release of the Nvidia Isaac Gr00t N1, the world’s first open foundation model for generalized humanoid robot reasoning and skills. This cross-magnification model uses multimodal inputs, including languages and images, to perform operational tasks in a variety of environments. The NVIDIA ISAAC GR00T-N1-2B model can be used with a face hug.
The ISAAC GR00T N1 was trained on a vast humanoid dataset consisting of actual captured data, synthetic data generated using components of the NVIDIA ISAAC GR00T Blueprint, and internet-scale video data. Can be adapted after training for specific embodiments, tasks, and environments.
The ISAAC GR00T N1 uses a single model and weight set to enable manipulation behavior on a variety of humanoid robots, such as the Fourier GR-1 and 1X NEO. It shows a robust generalization across a variety of tasks, including not only grasping and manipulating objects with one or both arms, but also transferring items between arms. It also allows for complex, multi-step tasks that require persistent context understanding and integration of diverse skills. These features make them suitable for material handling, packaging and inspection applications.
The ISAAC GR00T N1 features a dual system architecture inspired by human cognition, consisting of the following complementary components:
Vision Language Model (System 2): This systematic thinking system is based on Nvidia-eagle using SmollM-1.7b. Interpret the environment through vision and language instructions, allowing the robot to reason about the environment and instructions, and plan appropriate actions. Diffusion Transformer (System 1): This action model generates a continuous action to control the movement of the robot, transforming the action plan made by System 2 into accurate, continuous robot movements.
I’ll move forward
After training, it is a path to advance the autonomous system, creating a special model of downstream physical AI tasks.
Check out Github for Cosmos Predict and Cosmos Transfer Inference Scripts. For more information, see the Cosmos Transfer Research Paper.
The NVIDIA ISAAC GR00T-N1-2B model can be used with a face hug. Sample dataset and Pytorch script after training using custom user datasets. It is compatible with the hugging hugging face lerobot format. For more information about the ISAAC GR00T N1 model, see the research paper.
For more updates, hug and follow Nvidia.