Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
https://aurora-perception.github.io. Read our paper here: https://arxiv.org/abs/2412.03548. The code release is coming soon. Let’s push multimodal research forward together! 🚀\n","updatedAt":"2024-12-11T05:06:59.939Z","author":{"_id":"643be8879f5d314db2d9ed23","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643be8879f5d314db2d9ed23/VrW2UtJ7ppOnGIYjTWd7b.png","fullname":"Chen Dongping","name":"shuaishuaicdp","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7299351692199707},"editors":["shuaishuaicdp"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/643be8879f5d314db2d9ed23/VrW2UtJ7ppOnGIYjTWd7b.png"],"reactions":[],"isReport":false}},{"id":"675a3df348cd55d7460c68b2","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-12-12T01:35:47.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Learning to Ground VLMs without Forgetting](https://huggingface.co/papers/2410.10491) (2024)\n* [Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination](https://huggingface.co/papers/2411.12591) (2024)\n* [Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models](https://huggingface.co/papers/2411.18142) (2024)\n* [VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use](https://huggingface.co/papers/2410.16400) (2024)\n* [SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection](https://huggingface.co/papers/2412.02565) (2024)\n* [FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity](https://huggingface.co/papers/2411.15411) (2024)\n* [HyperSeg: Towards Universal Visual Segmentation with Large Language Model](https://huggingface.co/papers/2411.17606) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-12-12T01:35:47.167Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.720015287399292},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2412.03548","authors":[{"_id":"675880b250e4b1a8f1f54e78","user":{"_id":"636d316effbe479c979692e8","avatarUrl":"/avatars/f1004a8d788c3ee94f2cfd1abf314485.svg","isPro":false,"fullname":"Mahtab Bigverdi","user":"Mahtab","type":"user"},"name":"Mahtab Bigverdi","status":"admin_assigned","statusLastChangedAt":"2024-12-11T13:37:32.728Z","hidden":false},{"_id":"675880b250e4b1a8f1f54e79","name":"Zelun Luo","hidden":false},{"_id":"675880b250e4b1a8f1f54e7a","user":{"_id":"6408e8663461c51cf73480e4","avatarUrl":"/avatars/b0fecc524281931beee0d899bab011fb.svg","isPro":false,"fullname":"Cheng-Yu Hsieh","user":"cydhsieh01","type":"user"},"name":"Cheng-Yu Hsieh","status":"admin_assigned","statusLastChangedAt":"2024-12-11T13:38:32.254Z","hidden":false},{"_id":"675880b250e4b1a8f1f54e7b","user":{"_id":"65f7ee498fb2b15357cd6286","avatarUrl":"/avatars/a060fdd27b6a23346a8772d49df39018.svg","isPro":false,"fullname":"shen","user":"Ethanshen","type":"user"},"name":"Ethan Shen","status":"admin_assigned","statusLastChangedAt":"2024-12-11T13:38:09.784Z","hidden":false},{"_id":"675880b250e4b1a8f1f54e7c","user":{"_id":"65e2be1e630e2db23829ee8d","avatarUrl":"/avatars/294f9ba909037f03669dc0bb80cabfe3.svg","isPro":false,"fullname":"Dongping Chen","user":"fjchendp","type":"user"},"name":"Dongping Chen","status":"admin_assigned","statusLastChangedAt":"2024-12-11T13:37:53.324Z","hidden":false},{"_id":"675880b250e4b1a8f1f54e7d","name":"Linda G. Shapiro","hidden":false},{"_id":"675880b250e4b1a8f1f54e7e","user":{"_id":"66429868ab89e3a3a85668b0","avatarUrl":"/avatars/170e0daa454838deee2bf946f7118651.svg","isPro":false,"fullname":"Ranjay Krishna","user":"ranjaykrishna","type":"user"},"name":"Ranjay Krishna","status":"admin_assigned","statusLastChangedAt":"2024-12-11T13:37:18.312Z","hidden":false}],"publishedAt":"2024-12-04T18:45:35.000Z","submittedOnDailyAt":"2024-12-11T02:36:59.931Z","title":"Perception Tokens Enhance Visual Reasoning in Multimodal Language Models","submittedOnDailyBy":{"_id":"643be8879f5d314db2d9ed23","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643be8879f5d314db2d9ed23/VrW2UtJ7ppOnGIYjTWd7b.png","isPro":false,"fullname":"Chen Dongping","user":"shuaishuaicdp","type":"user"},"summary":"Multimodal language models (MLMs) still face challenges in fundamental visual\nperception tasks where specialized models excel. Tasks requiring reasoning\nabout 3D structures benefit from depth estimation, and reasoning about 2D\nobject instances benefits from object detection. Yet, MLMs can not produce\nintermediate depth or boxes to reason over. Finetuning MLMs on relevant data\ndoesn't generalize well and outsourcing computation to specialized vision tools\nis too compute-intensive and memory-inefficient. To address this, we introduce\nPerception Tokens, intrinsic image representations designed to assist reasoning\ntasks where language is insufficient. Perception tokens act as auxiliary\nreasoning tokens, akin to chain-of-thought prompts in language models. For\nexample, in a depth-related task, an MLM augmented with perception tokens can\nreason by generating a depth map as tokens, enabling it to solve the problem\neffectively. We propose AURORA, a training method that augments MLMs with\nperception tokens for improved reasoning over visual inputs. AURORA leverages a\nVQVAE to transform intermediate image representations, such as depth maps into\na tokenized format and bounding box tokens, which is then used in a multi-task\ntraining framework. AURORA achieves notable improvements across counting\nbenchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench,\noutperforming finetuning approaches in generalization across datasets. It also\nimproves on relative depth: over +6% on BLINK. With perception tokens, AURORA\nexpands the scope of MLMs beyond language-based reasoning, paving the way for\nmore effective visual reasoning capabilities.","upvotes":17,"discussionId":"675880b450e4b1a8f1f54eca","ai_summary":"AURORA, a method that incorporates Perception Tokens into MLMs, improves visual reasoning by transforming intermediate image representations into tokenized formats, enhancing performance on tasks like counting and depth estimation.","ai_keywords":["multimodal language models","MLMs","depth estimation","object detection","perception tokens","chain-of-thought prompts","VQVAE","tokenized format","bounding box tokens","multi-task training","BLINK","CVBench","SEED-Bench","relative depth"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"643be8879f5d314db2d9ed23","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643be8879f5d314db2d9ed23/VrW2UtJ7ppOnGIYjTWd7b.png","isPro":false,"fullname":"Chen Dongping","user":"shuaishuaicdp","type":"user"},{"_id":"663ba53b96c98cfb8ded1c4f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6-ipDyZbMYj_lfVurLjvh.png","isPro":true,"fullname":"Shu Pu","user":"pudashi","type":"user"},{"_id":"639d94ab7145123e0d44e48a","avatarUrl":"/avatars/5bb6a65b306d1383c4a8bcd9334b470a.svg","isPro":false,"fullname":"Yue Huang","user":"HowieHwong","type":"user"},{"_id":"65c867cf1b1a5743b3d2fc58","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65c867cf1b1a5743b3d2fc58/1c8vQtwY7ZNl4e4U4pAix.jpeg","isPro":false,"fullname":"Yaochen Wang","user":"MisakiWang","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6371ad82f0fe906bdc5b15f6","avatarUrl":"/avatars/ddc61e1edae5bd6b19530e1bc5e15d53.svg","isPro":false,"fullname":"Dotanoob7","user":"Dotanoob","type":"user"},{"_id":"66ee4ec36babd2a70556b8e4","avatarUrl":"/avatars/d7b4c3ce1367e5b4ff8eab5647abbe0b.svg","isPro":false,"fullname":"YanruWu","user":"YanruWu","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"65a13cb1c5770b27aef2a2bc","avatarUrl":"/avatars/88ec5b988f10ad9fd4d469ae2fa34680.svg","isPro":false,"fullname":"Chujie Gao","user":"Flossie","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"646325085897b675c65aea0f","avatarUrl":"/avatars/28ce7388f9318b49bdd0a5594c0f6732.svg","isPro":false,"fullname":"Baichuan Zhou","user":"bczhou","type":"user"},{"_id":"668cd4bbe990292e5f6974d3","avatarUrl":"/avatars/d1747b2372e94500ecb5fb56809b482d.svg","isPro":false,"fullname":"Jinyeong Kim","user":"rubatoyeong","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
AURORA, a method that incorporates Perception Tokens into MLMs, improves visual reasoning by transforming intermediate image representations into tokenized formats, enhancing performance on tasks like counting and depth estimation.
AI-generated summary
Multimodal language models (MLMs) still face challenges in fundamental visual
perception tasks where specialized models excel. Tasks requiring reasoning
about 3D structures benefit from depth estimation, and reasoning about 2D
object instances benefits from object detection. Yet, MLMs can not produce
intermediate depth or boxes to reason over. Finetuning MLMs on relevant data
doesn't generalize well and outsourcing computation to specialized vision tools
is too compute-intensive and memory-inefficient. To address this, we introduce
Perception Tokens, intrinsic image representations designed to assist reasoning
tasks where language is insufficient. Perception tokens act as auxiliary
reasoning tokens, akin to chain-of-thought prompts in language models. For
example, in a depth-related task, an MLM augmented with perception tokens can
reason by generating a depth map as tokens, enabling it to solve the problem
effectively. We propose AURORA, a training method that augments MLMs with
perception tokens for improved reasoning over visual inputs. AURORA leverages a
VQVAE to transform intermediate image representations, such as depth maps into
a tokenized format and bounding box tokens, which is then used in a multi-task
training framework. AURORA achieves notable improvements across counting
benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench,
outperforming finetuning approaches in generalization across datasets. It also
improves on relative depth: over +6% on BLINK. With perception tokens, AURORA
expands the scope of MLMs beyond language-based reasoning, paving the way for
more effective visual reasoning capabilities.