The First Healthcare Robotics Dataset and Foundational Physical AI Models for Healthcare Robotics

A large-scale community collaboration has launched Open-H-Embodiment, the first open dataset for healthcare robotics, alongside two new foundational Physical AI models designed to advance surgical and ultrasound robotics.

Introducing Open-H-Embodiment: The first healthcare robotics open dataset, built by a community collaboration

Authors: Nigel Nelson, Lukas Zbinden, Mostafa Toloui, Sean Huver

Healthcare AI has mainly been perception-based, focusing on models that interpret signals and classify or segment pathology/anatomy. However, healthcare involves "doing," making the static, perception-only datasets of the past—which lack embodiment, contact dynamics, and closed-loop control—insufficient. The field needs standardized robot bodies, synchronized vision–force–kinematics data, sim-to-real pairing, and cross-embodiment benchmarks to build the foundation for Physical AI.

Open-H-Embodiment is a community‑driven dataset initiative building the open, shared foundation needed to train and evaluate AI autonomy and world foundation models for surgical robotics and ultrasound. Started by a steering committee including Prof. Axel Krieger (Johns Hopkins), Prof. Nassir Navab (Technical University of Munich), and Dr. Mahdi Azizian (NVIDIA), the effort now spans 35 organizations.

Participants from around the world came together to build the first large scale dataset to advance the cause of physical AI in healthcare robotics.

Balgrist, CMR Surgical, The Chinese University of Hong Kong, Great Bay University, Hong Kong Baptist University, Hamlyn, ImFusion, Johns Hopkins University, Leeds University, Mohamed bin Zayed University of Artificial Intelligence, Moon Surgical, NVIDIA, Northwell Health, Obuda University, The Hong Kong Polytechnic University, Qilu Hospital of Shandong University, Rob Surgical, Sanoscience, Surgical Data Science Collective, Semaphor Surgical, Stanford, Dresden University of Technology, Technical University of Munich, Tuodao, Turin, University of British Columbia, UC Berkeley, UC San Diego, University of Illinois Chicago, University of Tennessee, University of Texas, Vanderbilt, and Virtual Incision.

Comprises 778 hours of CC-BY-4.0 healthcare robotics training data, largely surgical robotics, but also ultrasound and colonoscopy autonomy data.
Spans simulation, benchtop exercises (e.g., suturing), and real clinical procedures.
Uses commercial robots (CMR Surgical, Rob Surgical, Tuodao) and research robots (dVRK, Franka, Kuka).
Released alongside two new, permissively open-source models post-trained on this data.

First is GR00T-H, a derivative of the Isaac GR00T N series of Vision-Language-Action (VLA) models. Trained on roughly 600 hours of Open-H-Embodiment data, GR00T-H is the first policy model for surgical robotics tasks.

Building on NVIDIA’s open-source ecosystem, Isaac GR00T-H leverages Cosmos Reason 2 2B as its Vision-Language Model (VLM) backbone.

Surgical robotics requires high precision, but specialized hardware (like cable-driven systems) makes imitation learning (IL) difficult. To handle this, GR00T-H uses four key design choices:

Unique Embodiment Projectors: A unique, learnable MLP maps each robot's specific kinematics to a shared, normalized action space. State Dropout (100%): Proprioceptive input is dropped during inference to create a learned bias term for each system, yielding better real-world results. Relative EEF Actions: Training uses a common relative End-Effector (EEF) action space to overcome kinematic inconsistencies. Metadata in Task Prompts: Instrument names and control index mapping are injected directly into the VLM task prompt.

A prototype of GR00T-H has demonstrated the ability to execute a complete, end-to-end suture in the SutureBot benchmark, highlighting robust long-horizon dexterity.

Cosmos-H-Surgical-Simulator is a World Foundation Model (WFM) for action-conditioned surgical robotics. Traditional simulators fail due to real-world complexities like soft-tissue, reflections, blood, and smoke.

Overcoming the Sim-to-Real Gap:

Fine-tuned from NVIDIA Cosmos Predict 2.5 2B, it generates physically plausible surgical video directly from kinematic actions.
Efficiency Gains: For 600 rollouts, it took only 40 minutes in simulation versus 2 days using real-world benchtop methods.
WFM as a Physics Simulator: Implicitly learns tissue deformation and tool interaction from data.
Synthetic Data Generation: Generates realistic synthetic video-action pairs to augment underrepresented datasets.

The model was fine-tuned on the Open-H-Embodiment dataset (9 robot embodiments, 32 datasets) using 64x A100 GPUs for approximately 10,000 GPU-hours. It utilizes a unified 44-dimensional action space.

The goal for version 2 of the Open-H-Embodiment effort is to move beyond perceptual control to reasoning-capable autonomy—a surgical robotics ChatGPT moment—where systems can explain, plan, and adapt across long procedures. This requires extending Open-H-Embodiment into reasoning-ready data with annotated task traces capturing intents, outcomes, and failure modes. This effort needs community engagement, and we invite you to get involved.

Source: Hugging Face Blog