Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
[go: Go Back, main page]

https://github.com/lifuguan/IGGT_official

\n","updatedAt":"2025-10-28T03:50:53.028Z","author":{"_id":"6505a02f9310ce8c400edc63","avatarUrl":"/avatars/bbf781594fc8c812316711aa8e2797aa.svg","fullname":"Fangfu Liu","name":"Liuff23","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8055970072746277},"editors":["Liuff23"],"editorAvatarUrls":["/avatars/bbf781594fc8c812316711aa8e2797aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.22706","authors":[{"_id":"69003b1c22d452aac6dd43a3","user":{"_id":"667b8de7a68bf81afe668afe","avatarUrl":"/avatars/aeff10805ff858332e6f6a58735dbbd9.svg","isPro":false,"fullname":"leoli","user":"lifuguan","type":"user"},"name":"Hao Li","status":"claimed_verified","statusLastChangedAt":"2025-10-28T15:36:13.580Z","hidden":false},{"_id":"69003b1c22d452aac6dd43a4","name":"Zhengyu Zou","hidden":false},{"_id":"69003b1c22d452aac6dd43a5","name":"Fangfu Liu","hidden":false},{"_id":"69003b1c22d452aac6dd43a6","name":"Xuanyang Zhang","hidden":false},{"_id":"69003b1c22d452aac6dd43a7","name":"Fangzhou Hong","hidden":false},{"_id":"69003b1c22d452aac6dd43a8","name":"Yukang Cao","hidden":false},{"_id":"69003b1c22d452aac6dd43a9","name":"Yushi Lan","hidden":false},{"_id":"69003b1c22d452aac6dd43aa","name":"Manyuan Zhang","hidden":false},{"_id":"69003b1c22d452aac6dd43ab","name":"Gang Yu","hidden":false},{"_id":"69003b1c22d452aac6dd43ac","name":"Dingwen Zhang","hidden":false},{"_id":"69003b1c22d452aac6dd43ad","name":"Ziwei Liu","hidden":false}],"publishedAt":"2025-10-26T14:57:44.000Z","submittedOnDailyAt":"2025-10-28T02:20:52.943Z","title":"IGGT: Instance-Grounded Geometry Transformer for Semantic 3D\n Reconstruction","submittedOnDailyBy":{"_id":"6505a02f9310ce8c400edc63","avatarUrl":"/avatars/bbf781594fc8c812316711aa8e2797aa.svg","isPro":false,"fullname":"Fangfu Liu","user":"Liuff23","type":"user"},"summary":"Humans naturally perceive the geometric structure and semantic content of a\n3D world as intertwined dimensions, enabling coherent and accurate\nunderstanding of complex scenes. However, most prior approaches prioritize\ntraining large geometry models for low-level 3D reconstruction and treat\nhigh-level spatial understanding in isolation, overlooking the crucial\ninterplay between these two fundamental aspects of 3D-scene analysis, thereby\nlimiting generalization and leading to poor performance in downstream 3D\nunderstanding tasks. Recent attempts have mitigated this issue by simply\naligning 3D models with specific language models, thus restricting perception\nto the aligned model's capacity and limiting adaptability to downstream tasks.\nIn this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an\nend-to-end large unified transformer to unify the knowledge for both spatial\nreconstruction and instance-level contextual understanding. Specifically, we\ndesign a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode\na unified representation with geometric structures and instance-grounded\nclustering through only 2D visual inputs. This representation supports\nconsistent lifting of 2D visual inputs into a coherent 3D scene with explicitly\ndistinct object instances. To facilitate this task, we further construct\nInsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth\nmaps, and 3D-consistent instance-level mask annotations with a novel data\ncuration pipeline.","upvotes":42,"discussionId":"69003b1d22d452aac6dd43ae","projectPage":"https://lifuguan.github.io/IGGT_official","githubRepo":"https://github.com/lifuguan/IGGT_official","githubRepoAddedBy":"user","ai_summary":"InstanceGrounded Geometry Transformer (IGGT) unifies 3D reconstruction and instance-level understanding using a unified transformer and 3D-Consistent Contrastive Learning, supported by a new dataset InsScene-15K.","ai_keywords":["InstanceGrounded Geometry Transformer","IGGT","3D-Consistent Contrastive Learning","3D reconstruction","instance-level contextual understanding","3D scene","object instances","InsScene-15K"],"githubStars":341},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"667b8de7a68bf81afe668afe","avatarUrl":"/avatars/aeff10805ff858332e6f6a58735dbbd9.svg","isPro":false,"fullname":"leoli","user":"lifuguan","type":"user"},{"_id":"63a07c3ab5515dccd40fdb71","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a07c3ab5515dccd40fdb71/ly3pwhjWVge25LAeVgriV.png","isPro":false,"fullname":"Yukang Cao","user":"yukangcao","type":"user"},{"_id":"67f8df26523620864e32103d","avatarUrl":"/avatars/636371a2323ef8d4f5e8ef7849a74d3b.svg","isPro":false,"fullname":"zzy","user":"sdkjasd1","type":"user"},{"_id":"6505a02f9310ce8c400edc63","avatarUrl":"/avatars/bbf781594fc8c812316711aa8e2797aa.svg","isPro":false,"fullname":"Fangfu Liu","user":"Liuff23","type":"user"},{"_id":"62ab1ac1d48b4d8b048a3473","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656826685333-62ab1ac1d48b4d8b048a3473.png","isPro":false,"fullname":"Ziwei Liu","user":"liuziwei7","type":"user"},{"_id":"6527b7280ae663e384eb8499","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6527b7280ae663e384eb8499/73yF3eu2cUx7jVZrhXnXx.jpeg","isPro":false,"fullname":"Senqiao Yang","user":"Senqiao","type":"user"},{"_id":"6683fc5344a65be1aab25dc0","avatarUrl":"/avatars/e13cde3f87b59e418838d702807df3b5.svg","isPro":false,"fullname":"hjkim","user":"hojie11","type":"user"},{"_id":"665734f3ee8c2ed395f7f600","avatarUrl":"/avatars/caffe58e8b64b5f7c9332963ca390213.svg","isPro":false,"fullname":"jiaqi","user":"chenttt","type":"user"},{"_id":"68633ab108c0b63ef2ec5679","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/IEz7UJFvIk_K4DTTRCU_h.png","isPro":false,"fullname":"Xinhao Ji","user":"xinhaoji61","type":"user"},{"_id":"683e6f453ef1966f01adf2de","avatarUrl":"/avatars/931369dbcfb5954b5022bc6c158f089b.svg","isPro":false,"fullname":"xieqiang","user":"xieqiang2025","type":"user"},{"_id":"64b914c8ace99c0723ad83a9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b914c8ace99c0723ad83a9/B4gxNByeVY_xaOcjwiN1j.jpeg","isPro":false,"fullname":"Wei Cheng","user":"wchengad","type":"user"},{"_id":"68dba0a23f1b73dc898832b5","avatarUrl":"/avatars/28f2dffd58d33c463e98082603120082.svg","isPro":false,"fullname":"Jialu Li","user":"vickyli123","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2510.22706

IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

Published on Oct 26, 2025
· Submitted by
Fangfu Liu
on Oct 28, 2025
Authors:
Hao Li ,
,
,
,
,
,
,
,
,
,

Abstract

InstanceGrounded Geometry Transformer (IGGT) unifies 3D reconstruction and instance-level understanding using a unified transformer and 3D-Consistent Contrastive Learning, supported by a new dataset InsScene-15K.

AI-generated summary

Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.

Community

Paper submitter

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.22706 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.22706 in a Space README.md to link it from this page.

Collections including this paper 2