EMMA: Scaling Mobile Manipulation via Egocentric Human Data

Abstract

Scaling mobile manipulation imitation learning is bottlenecked by expensive mobile robot teleoperation. We present Egocentric Mobile Manipulation (EMMA), an end-to-end framework training mobile manipulation policies from human mobile manipulation data with static robot data, sidestepping mobile teleoperation. To accomplish this, we co-train the human full-body motion data with the static robot data. In our experiments across three real-world tasks, EMMA demonstrates comparable performance to baselines trained on teleoperated mobile robot data (Mobile ALOHA), achieving higher or equivalent task performance in full task success with significantly less data. We find that EMMA is able to generalize to new spatial configurations and scenes, and we observe positive performance scaling as we increase the hours of human data, opening new avenues for scalable robotic learning in real-world environments.

Video

Method

A unified co-training framework

EMMA: The model cotrains human fullbody motion data with static robot data. and outputs a unified policy that enables mobile manipulation deployment without mobile robot teleoperation.

Method figure — Left: Architecture of joint human-robot policy learning framework. Our model processes heterogeneous human and robot data through *stems* and decodes them through various action *heads*. The navigation head is deployed on the robot during evaluation, demonstrating transfer without robot supervision. Right: Our custom low-cost bimanual mobile manipulator.

Retargeting

Results

(A) In-domain Performance

EMMA (blue), trained without mobile teleoperation data, significantly outperforms Mobile ALOHA (orange) on Grocery Shopping and Handover Wine tasks (p < 0.05). Table Service variants show comparable performance. Error bars represent 95% confidence intervals from 50 trials.

(B) Scene Generalization

EMMA generalizes to new scenes by co-training on static robot data in original scene and human full-body motion data in new scene

(C) Scaling

Scaling human full-body motion data v/s scaling mobile robot teleop data

EMMA Trained on 1 hour human full-body motion data (Blue) strongly outperforms Mobile ALOHA trained on 1 hours of robot data (Orange) in the Handover Wine task. We show that scaling mobile manipulation with human data outperforms scaling on mobile robot teleopreation data collected under an equivalent amount of time.

Conclusion

In conclusion, we presented EMMA, a novel framework that enables mobile manipulation without requiring expensive mobile teleoperation. We find that EMMA outperforms Mobile ALOHA style teleoperation with equivalent data collection time and demonstrates superior scaling properties. Further, we ablate our key design decisions and observe that both the motion retargeter and the phase identification module are integral to downstream performance. Overall, we have demonstrated the possibility of scaling robot mobile manipulation performance with egocentric human data. We hope this result spurs new research and informs future researchers of potential opportunities and challenges.

BibTeX

@misc{zhu2025emmascalingmobilemanipulation,
  title={EMMA: Scaling Mobile Manipulation via Egocentric Human Data}, 
  author={Lawrence Y. Zhu and Pranav Kuppili and Ryan Punamiya and Patcharapong Aphiwetsa and Dhruv Patel and Simar Kareer and Sehoon Ha and Danfei Xu},
  year={2025},
  eprint={2509.04443},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2509.04443}, 
}