Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
HUG: Human Universal Grasping
[go: Go Back, main page]

Human Universal Grasping

1New York University    2Tsinghua University    3University of Michigan
‡ Equal advising.

Abstract

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website.

HUG learns dexterous grasping from egocentric human grasp data and retargets predicted grasps to robot hands for zero-shot in-the-wild grasping

HUG learns dexterous grasping without any robot data. Trained solely on egocentric human grasp data, HUG generates diverse human grasps for real-world objects in a single RGBD image captured from a stereo camera, which can be retargeted to robot hands for zero-shot, in-the-wild dexterous grasping.

HUG in-the-wild (autonomous, 4x). 30 objects with 10 uncut trials each, reaching 62.0% success in an unseen home with unseen objects. Success is counted as passing the object to the human.

Contributions

To our knowledge, HUG is the first grasping framework trained purely on human data and deployable across multiple robot embodiments. We open-source the following contributions:

1M-HUGs, 1M egocentric image-grasp pairs of natural human grasps across 6,707 recordings and 41 buildings, with MANO-fit hand poses and metric depth.

2 Method

HUG, a point-conditioned flow-matching model that predicts MANO grasps from RGB-D and retargets to multiple embodiments without per-hand training.

HUG-Bench, 90 unseen objects with metric-scale 3D meshes for paired simulation and real-world evaluation.

The 1M-HUGs Dataset

Sample egocentric frames from 1M-HUGs with MANO hand meshes overlaid on grasped everyday objects

1M-HUGs dataset. Our training data comprises 1M egocentric frames of human grasps, spanning 6,707 object instances. Each entry provides synchronized RGB and grayscale views, metric depth, an object mask, and a MANO hand pose with wrist transformation in the camera frame.

HUG Model Architecture

HUG pipeline: RGB-D plus a query point are fused by a transformer to condition a flow-matching model that outputs a MANO grasp, retargeted to robot hands

HUG architecture. Conditioned on an RGB-D image and a query point on the target object, HUG predicts MANO hand grasps via a flow-matching transformer over fused RGB and point cloud features. Predicted human grasps are then retargeted to robot hands.

HUG-Bench

HUG-Bench test split: 30 unseen objects across five geometry categories and three size bins, shown as real objects (top) and reconstructed simulation assets (bottom) Real-to-sim grasping: HUG evaluated on HUG-Bench in simulation using real captured RGB-D inputs and a reconstructed point cloud

HUG-Bench. 90 unseen everyday objects spanning five geometric categories and three size bins, each reconstructed into a metric-scale 3D mesh for paired evaluation in simulation and the real world. Left: the 30-object test split with real objects (top) and their simulation assets (bottom). Right: real-to-sim grasping, evaluating HUG on HUG-Bench in simulation using real captured inputs.

Real-World Failure Modes

Failure-mode breakdown: Sankey flow of 300 HUG-Bench test trials through pre-grasp, grasp, and lift stages for tabletop and in-the-wild settings

Failure modes. Grasp-outcome flow for the 300 HUG-Bench test trials in each real-world setting, tracing every attempt through the pre-grasp, grasp, and lift stages into success or a specific failure mode. The dominant cause of failures is the absence of motion planning in our minimal open-loop setup, where the hand often strikes the object or table as it closes. With force-aware grasping, we also expect fewer post-grasp slips, since HUG predicts a static pose with no notion of contact force.

BibTeX

If you find our work useful, please consider citing our paper:

@article{wu2026hug,
  title={Human Universal Grasping},
  author={Kevin Yuanbo Wu and Tianxing Zhou and Isaac Tu and Billy Yan and Irmak Guzey and David Fouhey and Dandan Shan and Lerrel Pinto},
  journal={arXiv preprint arXiv:2606.17054},
  year={2026}
}