HUG: Human Universal Grasping

Wu, Kevin Yuanbo; Zhou, Tianxing; Tu, Isaac; Yan, Billy; Guzey, Irmak; Fouhey, David; Shan, Dandan; Pinto, Lerrel

Human Universal Grasping

Kevin Yuanbo Wu¹ Tianxing Zhou^1,2 Isaac Tu¹ Billy Yan¹ Irmak Guzey¹
David Fouhey¹ Dandan Shan^1,3,‡ Lerrel Pinto^1,‡

¹New York University ²Tsinghua University ³University of Michigan

‡ Equal advising.

PDF arXiv Code Weights Data Soon Benchmark Soon

Abstract

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website.

HUG learns dexterous grasping from egocentric human grasp data and retargets predicted grasps to robot hands for zero-shot in-the-wild grasping

HUG learns dexterous grasping without any robot data. Trained solely on egocentric human grasp data, HUG generates diverse human grasps for real-world objects in a single RGBD image captured from a stereo camera, which can be retargeted to robot hands for zero-shot, in-the-wild dexterous grasping.

HUG in-the-wild (autonomous, 4x). 30 objects with 10 uncut trials each, reaching 62.0% success in an unseen home with unseen objects. Success is counted as passing the object to the human.

Contributions

To our knowledge, HUG is the first grasping framework trained purely on human data and deployable across multiple robot embodiments. We open-source the following contributions:

1 Dataset

1M-HUGs, 1M egocentric image-grasp pairs of natural human grasps across 6,707 recordings and 41 buildings, with MANO-fit hand poses and metric depth.

2 Method

HUG, a point-conditioned flow-matching model that predicts MANO grasps from RGB-D and retargets to multiple embodiments without per-hand training.

3 Benchmark

HUG-Bench, 90 unseen objects with metric-scale 3D meshes for paired simulation and real-world evaluation.

The 1M-HUGs Dataset

Data Soon

Sample egocentric frames from 1M-HUGs with MANO hand meshes overlaid on grasped everyday objects

1M-HUGs dataset. Our training data comprises 1M egocentric frames of human grasps, spanning 6,707 object instances. Each entry provides synchronized RGB and grayscale views, metric depth, an object mask, and a MANO hand pose with wrist transformation in the camera frame.

HUG Model Architecture

Code Weights

HUG pipeline: RGB-D plus a query point are fused by a transformer to condition a flow-matching model that outputs a MANO grasp, retargeted to robot hands

HUG architecture. Conditioned on an RGB-D image and a query point on the target object, HUG predicts MANO hand grasps via a flow-matching transformer over fused RGB and point cloud features. Predicted human grasps are then retargeted to robot hands.

HUG-Bench

Benchmark Soon

Buy the 30 test objects

	Cylindrical	Spheroidal	Prismatic	Appendaged	Amorphous
Small	Glue Stick Pepper Shaker	Strawberry Hacky Sack	Eraser Match Box	Nail Clipper Lock	Rubber Duck Tape Measure
Medium	Umbrella Bowl	Pear Softball	Card Deck Sponge	Dustpan Handbell	Tape Dispenser Grapes
Large	Spray Bottle Wine Bottle	Pineapple Football	Wipe Dispenser Storage Bin	Saucepan Picnic Basket	Headphones Easel

HUG-Bench test split: 30 unseen objects across five geometry categories and three size bins, shown as real objects (top) and reconstructed simulation assets (bottom)

Real-to-sim grasping: HUG evaluated on HUG-Bench in simulation using real captured RGB-D inputs and a reconstructed point cloud

HUG-Bench. 90 unseen everyday objects spanning five geometric categories and three size bins, each reconstructed into a metric-scale 3D mesh for paired evaluation in simulation and the real world. Left: the 30-object test split with real objects (top) and their simulation assets (bottom). Right: real-to-sim grasping, evaluating HUG on HUG-Bench in simulation using real captured inputs.

Baseline Comparisons

Pick a HUG-Bench test object to compare methods side by side. Each clip plays 10 rollouts consecutively (autonomous, 4x). Success is counted as lifting the object off the table. Overall success rate: HUG 66.7% · Dex1B 43.7% · CAP 32.7%

HUG

Dex1B

CAP

Note on Dex1B. Dex1B runs open-loop and can plan trajectories that collide with the table. Severe table collisions are aborted to protect the hardware and counted as failures, with the colliding trajectory shown in the right panel. Collisions that eject the hand from the safety mount are also counted as failures.

Real-World Failure Modes

Failure-mode breakdown: Sankey flow of 300 HUG-Bench test trials through pre-grasp, grasp, and lift stages for tabletop and in-the-wild settings

Failure modes. Grasp-outcome flow for the 300 HUG-Bench test trials in each real-world setting, tracing every attempt through the pre-grasp, grasp, and lift stages into success or a specific failure mode. The dominant cause of failures is the absence of motion planning in our minimal open-loop setup, where the hand often strikes the object or table as it closes. With force-aware grasping, we also expect fewer post-grasp slips, since HUG predicts a static pose with no notion of contact force.

BibTeX

If you find our work useful, please consider citing our paper:

@article{wu2026hug,
  title={Human Universal Grasping},
  author={Kevin Yuanbo Wu and Tianxing Zhou and Isaac Tu and Billy Yan and Irmak Guzey and David Fouhey and Dandan Shan and Lerrel Pinto},
  journal={arXiv preprint arXiv:2606.17054},
  year={2026}
}