Human2Sim2Robot

Abstract

Teaching robots dexterous manipulation skills often requires collecting hundreds of demonstrations using wearables or teleoperation, which is challenging to scale. Videos of human-object interactions are easier to collect and scale, but leveraging them directly for robot learning is difficult due to the lack of explicit action labels and morphological differences between robot and human hands.

We propose Human2Sim2Robot, a novel real-to-sim-to-real framework for training dexterous manipulation policies using only one RGB-D video of a human demonstrating a task. Our method utilizes reinforcement learning (RL) in simulation to cross the human-robot embodiment gap without relying on wearables, teleoperation, or large-scale data collection typically necessary for imitation learning methods. From the demonstration, we extract two task-specific components: (1) the object pose trajectory to define an object-centric, embodiment-agnostic reward function, and (2) the pre-manipulation hand pose to initialize and guide exploration during RL training. We found that these two components are highly effective for learning the desired task, eliminating the need for task-specific reward shaping and tuning. We demonstrate that Human2Sim2Robot significantly outperforms trajectory retargeting and one-shot imitation learning across a wide range of tasks, including grasping, non-prehensile manipulation, and extrinsic manipulation.

Highlights Video

For more information on motivation, method details, and deep-dive into experiments, please see our extended video at the bottom of this website.

Method

Real-World Robot Performance

All Human2Sim2Robot videos are played at 1x speed.

Grasping Tasks

Plate Lift Rack

✅: 60%
Pitcher Pour

✅: 100%

Non-Prehensile Tasks

Snackbox Push

✅: 100%
Snackbox Pivot

✅: 100%
Snackbox Push Pivot

✅: 100%
Plate Push

✅: 100%

Multi-Step Task

Plate Pivot Lift Rack

✅: 86.6%

Robustness & Failure Recovery

Out-of-Distribution Object Positions

Distractor Objects, Perturbations, and Background Changes

Background Changes
Lighting Changes
Distractors

Perturbations (Plate)
Human Interference
Perturbations (Snackbox)

Table Color / Friction
Table Color / Friction
Obstructions / Friction
Obstructing Objects

Baselines

All baseline videos are played at 2x speed.

Replay

Snackbox Push (✅: 10%)

Snackbox Pivot (✅: 100%)

Snackbox Push Pivot (✅: 40%)

Plate Push (✅: 0%)

Plate Lift Rack (✅: 0%)

Plate Pivot Lift Rack (✅: 0%)

Pitcher Pour (✅: 30%)

Object-Aware Replay

Snackbox Push (✅: 10%)

Snackbox Pivot (✅: 100%)

Snackbox Push Pivot (✅: 70%)

Plate Push (✅: 0%)

Plate Lift Rack (✅: 0%)

Plate Pivot Lift Rack (✅: 33.3%)

Pitcher Pour (✅: 50%)

Behavior Cloning (Diffusion Policy)

Snackbox Push (✅: 30%)

Snackbox Pivot (✅: 80%)

Snackbox Push Pivot (✅: 20%)

Plate Push (✅: 0%)

Plate Lift Rack (✅: 0%)

Plate Pivot Lift Rack (✅: 0%)

Pitcher Pour (✅: 40%)

Robustness Comparison

Baseline
Ours

Ablation Tests

Importance of Object Trajectory Tracking Rewards

Fixed Target
Interpolated Target
Downsampled Trajectory
Ours

Importance of Pre-Manipulation Pose Initialization

Default Initialization
Overhead Initialization
Pre-Manipulation Far
Ours

Sufficiency of Single Pre-Manipulation Hand Pose

Hand-Trajectory Tracking Rewards
Residual Policy
Ours

Extended Video

Acknowledgements

This work is supported by Stanford Human-Centered Artificial Intelligence (HAI), the National Science Foundation (NSF) under Grant Numbers 2153854 and 2342246, and the Natural Sciences and Engineering Research Council of Canada (NSERC) under Award Number 526541680.

BibTeX

@misc{lum2025crossinghumanrobotembodimentgap,
        author = {Tyler Ga Wei Lum and Olivia Y. Lee and C. Karen Liu and Jeannette Bohg},
        title = {Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration},
        year = {2025},
        eprint = {2504.12609},
        archivePrefix = {arXiv},
        primaryClass = {cs.RO},
        url = {https://arxiv.org/abs/2504.12609},
      }

Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration