Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-03-21T01:23:02.334Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7359945774078369},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2403.12596","authors":[{"_id":"65fa55c8e483037d83fa56cb","user":{"_id":"656054efc40c3a6a0d9d7b3e","avatarUrl":"/avatars/6d93e933df1cd59b836f5f79b5e938bf.svg","isPro":false,"fullname":"Victor Carbune","user":"vcarbune","type":"user"},"name":"Victor Carbune","status":"admin_assigned","statusLastChangedAt":"2024-03-21T12:55:55.218Z","hidden":false},{"_id":"65fa55c8e483037d83fa56cc","user":{"_id":"629098fd5463575364e7697a","avatarUrl":"/avatars/e45f3ee28bff3571f76caf847d3c36db.svg","isPro":false,"fullname":"Hassan Mansoor","user":"HassanMansoor","type":"user"},"name":"Hassan Mansoor","status":"admin_assigned","statusLastChangedAt":"2024-03-21T12:55:48.884Z","hidden":false},{"_id":"65fa55c8e483037d83fa56cd","user":{"_id":"5f881856ee5616341bc51e67","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f881856ee5616341bc51e67/9UCZCuhBTpJC9tGPyGmMb.jpeg","isPro":false,"fullname":"Fangyu Liu","user":"fl399","type":"user"},"name":"Fangyu Liu","status":"admin_assigned","statusLastChangedAt":"2024-03-21T12:56:01.262Z","hidden":false},{"_id":"65fa55c8e483037d83fa56ce","user":{"_id":"60270a7c32856987162c641a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60270a7c32856987162c641a/yNO2n0MMOiTqRaHFD2bv1.jpeg","isPro":false,"fullname":"Rahul","user":"rahular","type":"user"},"name":"Rahul Aralikatte","status":"admin_assigned","statusLastChangedAt":"2024-03-21T12:56:08.802Z","hidden":false},{"_id":"65fa55c8e483037d83fa56cf","user":{"_id":"633bf74e9a0fb78266be26cf","avatarUrl":"/avatars/360e37dff6841ee97779743e94e62213.svg","isPro":false,"fullname":"Gilles Baechler","user":"g-les","type":"user"},"name":"Gilles Baechler","status":"admin_assigned","statusLastChangedAt":"2024-03-21T12:56:15.562Z","hidden":false},{"_id":"65fa55c8e483037d83fa56d0","user":{"_id":"635242b8b3678a43742baec6","avatarUrl":"/avatars/fc9f6d3922c76ab902cb11ed23554c54.svg","isPro":false,"fullname":"chenjindong","user":"chenjindong","type":"user"},"name":"Jindong Chen","status":"admin_assigned","statusLastChangedAt":"2024-03-21T12:56:28.070Z","hidden":false},{"_id":"65fa55c8e483037d83fa56d1","name":"Abhanshu Sharma","hidden":false}],"publishedAt":"2024-03-19T10:03:07.000Z","submittedOnDailyAt":"2024-03-20T01:49:37.905Z","title":"Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Vision-language models (VLMs) are achieving increasingly strong performance\non multimodal tasks. However, reasoning capabilities remain limited\nparticularly for smaller VLMs, while those of large-language models (LLMs) have\nseen numerous improvements. We propose a technique to transfer capabilities\nfrom LLMs to VLMs. On the recently introduced ChartQA, our method obtains\nstate-of-the-art performance when applied on the PaLI3-5B VLM by\nchen2023pali3, while also enabling much better performance on PlotQA\nand FigureQA.\n We first improve the chart representation by continuing the pre-training\nstage using an improved version of the chart-to-table translation task by\nliu2023deplot. We then propose constructing a 20x larger dataset than\nthe original training set. To improve general reasoning capabilities and\nimprove numerical operations, we synthesize reasoning traces using the table\nrepresentation of charts. Lastly, our model is fine-tuned using the multitask\nloss introduced by hsieh2023distilling.\n Our variant ChartPaLI-5B outperforms even 10x larger models such as PaLIX-55B\nwithout using an upstream OCR system, while keeping inference time constant\ncompared to the PaLI3-5B baseline. When rationales are further refined with a\nsimple program-of-thought prompt chen2023program, our model outperforms\nthe recently introduced Gemini Ultra and GPT-4V.","upvotes":11,"discussionId":"65fa55c9e483037d83fa5719","ai_summary":"A method transfers reasoning capabilities from large-language models to vision-language models, achieving state-of-the-art performance on ChartQA and superior performance on PlotQA and FigureQA by enhancing chart representation and training with a larger, synthesized dataset and multitask loss.","ai_keywords":["vision-language models","large-language models","ChartQA","PaLI3-5B","plotQA","FigureQA","chart-to-table translation","reasoning traces","multitask loss","ChartPaLI-5B","PaLIX-55B","OCR system","program-of-thought prompt","Gemini Ultra","GPT-4V"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","isPro":false,"fullname":"Byung-Kwan Lee","user":"BK-Lee","type":"user"},{"_id":"5ecea265968f6028e0559fa5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1619623771844-5ecea265968f6028e0559fa5.jpeg","isPro":true,"fullname":"Victor Sanh","user":"VictorSanh","type":"user"},{"_id":"6555125a4f361968f0e3aad7","avatarUrl":"/avatars/e7692d82804338f21ecdc6e731f5c5ea.svg","isPro":false,"fullname":"marinaretikof","user":"marinaretik","type":"user"},{"_id":"630c2ddb86b8b9904c3860a6","avatarUrl":"/avatars/9b6cec2e9e269ccac1533eb7bf1ac2c5.svg","isPro":false,"fullname":"Igor Melnyk","user":"imelnyk","type":"user"},{"_id":"644f10d267a3dd3d072a2669","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644f10d267a3dd3d072a2669/r7AA6gkm-AuQLHQ77G7d5.png","isPro":false,"fullname":"Neil Van","user":"nvhf","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"},{"_id":"67b8713377a173b77b11fc19","avatarUrl":"/avatars/065960d94c605fe599d60f74c2516e25.svg","isPro":false,"fullname":"Kasturi Gnanaguna Sagar","user":"kasturi-sarvam","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A method transfers reasoning capabilities from large-language models to vision-language models, achieving state-of-the-art performance on ChartQA and superior performance on PlotQA and FigureQA by enhancing chart representation and training with a larger, synthesized dataset and multitask loss.
AI-generated summary
Vision-language models (VLMs) are achieving increasingly strong performance
on multimodal tasks. However, reasoning capabilities remain limited
particularly for smaller VLMs, while those of large-language models (LLMs) have
seen numerous improvements. We propose a technique to transfer capabilities
from LLMs to VLMs. On the recently introduced ChartQA, our method obtains
state-of-the-art performance when applied on the PaLI3-5B VLM by
chen2023pali3, while also enabling much better performance on PlotQA
and FigureQA.
We first improve the chart representation by continuing the pre-training
stage using an improved version of the chart-to-table translation task by
liu2023deplot. We then propose constructing a 20x larger dataset than
the original training set. To improve general reasoning capabilities and
improve numerical operations, we synthesize reasoning traces using the table
representation of charts. Lastly, our model is fine-tuned using the multitask
loss introduced by hsieh2023distilling.
Our variant ChartPaLI-5B outperforms even 10x larger models such as PaLIX-55B
without using an upstream OCR system, while keeping inference time constant
compared to the PaLI3-5B baseline. When rationales are further refined with a
simple program-of-thought prompt chen2023program, our model outperforms
the recently introduced Gemini Ultra and GPT-4V.