Author: Nigel Nelson, Lukas Zbinden, Mostafa Tolui, Sean Huber
Healthcare AI is primarily perception-based and focuses on models that interpret signals and classify or segment pathology/anatomy. However, because medicine involves action, historical static, perception-only datasets lacking embodiment, contact dynamics, and closed-loop control are insufficient. The field requires standardized robot bodies, synchronized vision, force, and kinematic data, simulation-reality pairing, and cross-body benchmarks to build the foundation for physical AI.
1. Embodiment of open H
Open-H-Embodiment is a community-driven dataset initiative that is building an open, shared foundation for the training and evaluation of world-based models for AI autonomy and surgical robotics and ultrasound. The initiative, started by a steering committee including Professor Axel Krieger (Johns Hopkins University), Professor Nasir Navab (Technical University of Munich), and Dr. Mahdi Azizian (NVIDIA), has now spread to 35 organizations.
Participants from around the world came together to build the first large-scale dataset aimed at advancing physical AI in medical robotics.
Open H embodiment sample data
participants
Balgrist, CMR Surgical, Chinese University of Hong Kong, University of Great Bay, Hong Kong Baptist University, Hamlin, ImFusion, Johns Hopkins University, University of Leeds, Mohammed bin Zayed University of Artificial Intelligence, Moon Surgical, NVIDIA, Northwell Health, Obuda University, Hong Kong Polytechnic University, Shandong University Qilu Hospital, Rob Surgical, SanoScience, Surgical Data Science Collective, Semaphor Surgical, Stanford, Dresden University of Technology, Munich University of Technology, Tuodao, Turin, University of British Columbia, University of California, Berkeley, University of California, San Diego, University of Illinois at Chicago, University of Tennessee, University of Texas at Vanderbilt, and Virtual Incision.
dataset
Consists of 778 hours of CC-BY-4.0 healthcare robot training data, primarily for surgical robots, but also includes ultrasound and colonoscopy autonomy data. Ranging from simulations, benchtop exercises (such as suturing), and actual clinical procedures. We use commercial robots (CMR Surgical, Rob Surgical, Tuodao) and research robots (dVRK, Franka, Kuka). It was released with two new forgiving open source models post-trained on this data.
2. GR00T-H: Visual language action model for surgical robots
The first is the GR00T-H, which is a derivative of the Isaac GR00T N series Vision-Language-Action (VLA) model. GR00T-H, trained on approximately 600 hours of Open-H-Embodiment data, is the first policy model for surgical robotic tasks.
Built on NVIDIA’s open source ecosystem, Isaac GR00T-H leverages Cosmos Reason 2 2B as its Vision Language Model (VLM) backbone.

Architectural design selection
Surgical robotics requires high precision, but specialized hardware (e.g., cable drive systems) makes imitation learning (IL) difficult. To address this, the GR00T-H uses four main design choices.
Unique Embodiment Projector: A unique learnable MLP maps each robot’s unique kinematics to a shared, normalized action space. State Dropout (100%): Proprioceptive inputs are removed during inference to create a learned bias term for each system, resulting in better real-world results. Relative EEF actions: Training uses a common relative end effector (EEF) action space to overcome kinematic mismatches. Task prompt metadata: Equipment name and control index mapping are inserted directly into the VLM task prompt.
The GR00T-H prototype demonstrated its ability to perform complete end-to-end suturing in the SutureBot benchmark, highlighting its robust long-term dexterity.

GR00T-H performs end-to-end suturing.
3. Cosmos-H-Surgical Simulator
Cosmos-H-Surgical-Simulator is a World Foundation Model (WFM) for action-conditioned surgical robotics. Traditional simulators fail due to real-world complexities such as soft tissue, reflections, blood, and smoke.
Main features
Overcoming the Sim-to-Real Gap: Fine-tuned from NVIDIA Cosmos Predict 2.5 2B to generate physically plausible surgical videos directly from kinematic actions. Improved efficiency: The 600 rollouts took only 40 minutes using simulation, compared to 2 days using the actual benchtop approach. WFM as a physics simulator: Implicitly learning tissue deformations and tool interactions from data. Generate synthetic data: Generate realistic synthetic video and action pairs to augment underrepresented datasets.

Fine-tune details
The model was fine-tuned on the Open-H-Embodiment dataset (9 robot embodiments, 32 datasets) using 64x A100 GPUs for approximately 10,000 GPU hours. Utilizes a unified 44-dimensional action space.
4. What’s Next: Toward Reasoning in Surgical Robotics
The goal of version 2 of the Open-H-Embodiment effort is to move beyond perceptual control to reasonable autonomy (surgical robotics’ ChatGPT moment), where systems can explain, plan, and adapt throughout long surgeries. This requires extending Open-H-Embodiment to inference-ready data with annotated task traces that capture intentions, outcomes, and failure modes. This effort requires community participation, so please join us. Visit the Open-H Github repository and help shape the future of medical robotics.
5. Start today
To start working with Open-H embodiment datasets and models, access the following resources:

