Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Visual Question Decomposition on Multimodal Large Language Models
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-10-02T01:33:42.049Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7455869317054749},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2409.19339","authors":[{"_id":"66fbeeb0f553900c02224daa","user":{"_id":"637169557a5e5d8efdc3e58e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1668515232215-637169557a5e5d8efdc3e58e.jpeg","isPro":false,"fullname":"Haowei Zhang","user":"freesky","type":"user"},"name":"Haowei Zhang","status":"claimed_verified","statusLastChangedAt":"2024-10-02T07:41:20.732Z","hidden":false},{"_id":"66fbeeb0f553900c02224dab","name":"Jianzhe Liu","hidden":false},{"_id":"66fbeeb0f553900c02224dac","name":"Zhen Han","hidden":false},{"_id":"66fbeeb0f553900c02224dad","user":{"_id":"648cbea3dee03837c823cbf2","avatarUrl":"/avatars/3f8c36436a5cbff2948df099ae604418.svg","isPro":false,"fullname":"Shuo Chen","user":"ShuoChen99","type":"user"},"name":"Shuo Chen","status":"claimed_verified","statusLastChangedAt":"2024-10-02T07:41:22.914Z","hidden":false},{"_id":"66fbeeb0f553900c02224dae","name":"Bailan He","hidden":false},{"_id":"66fbeeb0f553900c02224daf","name":"Volker Tresp","hidden":false},{"_id":"66fbeeb0f553900c02224db0","name":"Zhiqiang Xu","hidden":false},{"_id":"66fbeeb0f553900c02224db1","user":{"_id":"67d0159c0fc5d2bf9502c59a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/sOZLe-APA6UHPKci7P-C9.png","isPro":false,"fullname":"Gu","user":"Jindong01","type":"user"},"name":"Jindong Gu","status":"extracted_pending","statusLastChangedAt":"2025-03-11T12:13:42.512Z","hidden":false}],"publishedAt":"2024-09-28T12:49:16.000Z","submittedOnDailyAt":"2024-10-01T16:44:57.099Z","title":"Visual Question Decomposition on Multimodal Large Language Models","submittedOnDailyBy":{"_id":"648cbea3dee03837c823cbf2","avatarUrl":"/avatars/3f8c36436a5cbff2948df099ae604418.svg","isPro":false,"fullname":"Shuo Chen","user":"ShuoChen99","type":"user"},"summary":"Question decomposition has emerged as an effective strategy for prompting\nLarge Language Models (LLMs) to answer complex questions. However, while\nexisting methods primarily focus on unimodal language models, the question\ndecomposition capability of Multimodal Large Language Models (MLLMs) has yet to\nbe explored. To this end, this paper explores visual question decomposition on\nMLLMs. Specifically, we introduce a systematic evaluation framework including a\ndataset and several evaluation criteria to assess the quality of the decomposed\nsub-questions, revealing that existing MLLMs struggle to produce high-quality\nsub-questions. To address this limitation, we propose a specific finetuning\ndataset, DecoVQA+, for enhancing the model's question decomposition capability.\nAiming at enabling models to perform appropriate selective decomposition, we\npropose an efficient finetuning pipeline. The finetuning pipeline consists of\nour proposed dataset and a training objective for selective decomposition.\nFinetuned MLLMs demonstrate significant improvements in the quality of\nsub-questions and the policy of selective question decomposition. Additionally,\nthe models also achieve higher accuracy with selective decomposition on VQA\nbenchmark datasets.","upvotes":8,"discussionId":"66fbeeb5f553900c02224fdd","ai_summary":"An evaluation framework and finetuning pipeline for enhancing the question decomposition capability of Multimodal Large Language Models are proposed, improving sub-question quality and VQA accuracy.","ai_keywords":["Multimodal Large Language Models","question decomposition","evaluation framework","dataset","DecoVQA+","finetuning pipeline","selective decomposition","VQA benchmark datasets"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"648cbea3dee03837c823cbf2","avatarUrl":"/avatars/3f8c36436a5cbff2948df099ae604418.svg","isPro":false,"fullname":"Shuo Chen","user":"ShuoChen99","type":"user"},{"_id":"637169557a5e5d8efdc3e58e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1668515232215-637169557a5e5d8efdc3e58e.jpeg","isPro":false,"fullname":"Haowei Zhang","user":"freesky","type":"user"},{"_id":"61b326c20d381ede9f1cf0d1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b326c20d381ede9f1cf0d1/77kM_MChVp3ah-XWoMvPE.jpeg","isPro":false,"fullname":"Shawon Ashraf","user":"shawon","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"668cd4bbe990292e5f6974d3","avatarUrl":"/avatars/d1747b2372e94500ecb5fb56809b482d.svg","isPro":false,"fullname":"Jinyeong Kim","user":"rubatoyeong","type":"user"},{"_id":"641b754d1911d3be6745cce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b754d1911d3be6745cce9/Ydjcjd4VuNUGj5Cd4QHdB.png","isPro":false,"fullname":"atayloraerospace","user":"Taylor658","type":"user"},{"_id":"66b46b576badc45885923979","avatarUrl":"/avatars/622e95c50876d48fc2dcb9b0dbc74607.svg","isPro":false,"fullname":"Hyunjin Cho","user":"jfdkjjs","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
An evaluation framework and finetuning pipeline for enhancing the question decomposition capability of Multimodal Large Language Models are proposed, improving sub-question quality and VQA accuracy.
AI-generated summary
Question decomposition has emerged as an effective strategy for prompting
Large Language Models (LLMs) to answer complex questions. However, while
existing methods primarily focus on unimodal language models, the question
decomposition capability of Multimodal Large Language Models (MLLMs) has yet to
be explored. To this end, this paper explores visual question decomposition on
MLLMs. Specifically, we introduce a systematic evaluation framework including a
dataset and several evaluation criteria to assess the quality of the decomposed
sub-questions, revealing that existing MLLMs struggle to produce high-quality
sub-questions. To address this limitation, we propose a specific finetuning
dataset, DecoVQA+, for enhancing the model's question decomposition capability.
Aiming at enabling models to perform appropriate selective decomposition, we
propose an efficient finetuning pipeline. The finetuning pipeline consists of
our proposed dataset and a training objective for selective decomposition.
Finetuned MLLMs demonstrate significant improvements in the quality of
sub-questions and the policy of selective question decomposition. Additionally,
the models also achieve higher accuracy with selective decomposition on VQA
benchmark datasets.
TL;DR: This paper explores visual question decomposition in Multimodal Large Language Models (MLLMs), revealing that existing models struggle with producing high-quality sub-questions. To improve this, we introduce DecoVQA+, a finetuning dataset, and propose an efficient training pipeline for selective decomposition. The finetuned models show improved sub-question quality and decomposition policies, leading to higher accuracy on VQA benchmarks.