Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
[go: Go Back, main page]

https://charxiv.github.io/#leaderboard
๐Ÿ“œ Preprint: https://arxiv.org/abs/2406.18521

\n

๐Ÿ“Š Charxiv is โœจ๐Ÿ๐ŸŽ๐ŸŽ% handcrafted with rigorous human validation, and it reveals substantial gaps among Multimodal Large Language Models and humans in chart understanding.

\n","updatedAt":"2024-06-27T02:39:49.504Z","author":{"_id":"641a38fdfb5ffff5ac78ceb0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641a38fdfb5ffff5ac78ceb0/53nGstZ9Ya5WfeGdXAXCr.png","fullname":"Zirui Wang","name":"zwcolin","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8732403516769409},"editors":["zwcolin"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/641a38fdfb5ffff5ac78ceb0/53nGstZ9Ya5WfeGdXAXCr.png"],"reactions":[{"reaction":"๐Ÿ”ฅ","users":["YANG-Cheng","Minbyul","AdinaY","howard50b","marinaretik","princeton-nlp"],"count":6}],"isReport":false}},{"id":"6687df5e978a88805790106c","author":{"_id":"6047c7582d91124a58b0da44","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6047c7582d91124a58b0da44/omyKMSweUbwCbyZaZwvIM.jpeg","fullname":"Sahar M","name":"saharmor","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-07-05T11:56:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Kudos @zwcolin and team. I've featured this paper in my AI research newsletter www.aitidbits.ai/p/july-4th-2024#:~:text=of%20human%20performance-,Princeton,-develops\n\nLooking forward to more novel papers and methods.","html":"

Kudos \n\n@zwcolin\n\t and team. I've featured this paper in my AI research newsletter www.aitidbits.ai/p/july-4th-2024#:~:text=of%20human%20performance-,Princeton,-develops

\n

Looking forward to more novel papers and methods.

\n","updatedAt":"2024-07-05T11:56:14.099Z","author":{"_id":"6047c7582d91124a58b0da44","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6047c7582d91124a58b0da44/omyKMSweUbwCbyZaZwvIM.jpeg","fullname":"Sahar M","name":"saharmor","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8852108716964722},"editors":["saharmor"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6047c7582d91124a58b0da44/omyKMSweUbwCbyZaZwvIM.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2406.18521","authors":[{"_id":"667cbc2b557fe99a9eff62d9","user":{"_id":"641a38fdfb5ffff5ac78ceb0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641a38fdfb5ffff5ac78ceb0/53nGstZ9Ya5WfeGdXAXCr.png","isPro":true,"fullname":"Zirui Wang","user":"zwcolin","type":"user"},"name":"Zirui Wang","status":"claimed_verified","statusLastChangedAt":"2024-06-27T07:31:28.683Z","hidden":false},{"_id":"667cbc2b557fe99a9eff62da","user":{"_id":"6241126dedf27fda71c6b573","avatarUrl":"/avatars/ac106139c0e5967776f39e2755ca0792.svg","isPro":false,"fullname":"Mengzhou Xia","user":"mengzhouxia","type":"user"},"name":"Mengzhou Xia","status":"admin_assigned","statusLastChangedAt":"2024-06-27T14:20:32.941Z","hidden":false},{"_id":"667cbc2b557fe99a9eff62db","name":"Luxi He","hidden":false},{"_id":"667cbc2b557fe99a9eff62dc","user":{"_id":"60878324a5da133ac6c3869b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60878324a5da133ac6c3869b/7ZEBNelx26UgDvT6g5FUM.png","isPro":false,"fullname":"Howard Chen","user":"howard50b","type":"user"},"name":"Howard Chen","status":"claimed_verified","statusLastChangedAt":"2024-06-27T15:36:51.913Z","hidden":false},{"_id":"667cbc2b557fe99a9eff62dd","user":{"_id":"65e7c3a2706749f643c3386e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/cAZCcbJRizBCR7vCsWZUw.jpeg","isPro":false,"fullname":"Yitao Liu","user":"yitaoliu917","type":"user"},"name":"Yitao Liu","status":"admin_assigned","statusLastChangedAt":"2024-06-27T14:21:51.169Z","hidden":false},{"_id":"667cbc2b557fe99a9eff62de","user":{"_id":"6369b5d256d1f9349813278b","avatarUrl":"/avatars/7ed478f75d53b80d7656394afdf257c7.svg","isPro":false,"fullname":"Richard Zhu","user":"rich123","type":"user"},"name":"Richard Zhu","status":"claimed_verified","statusLastChangedAt":"2024-06-27T15:58:07.901Z","hidden":false},{"_id":"667cbc2b557fe99a9eff62df","user":{"_id":"650237b5d1b0b0db4f29ae8a","avatarUrl":"/avatars/8f92cf8f3f1ddb45c2c58c4a59ce4633.svg","isPro":false,"fullname":"KAIQU LIANG","user":"kaiquliang","type":"user"},"name":"Kaiqu Liang","status":"admin_assigned","statusLastChangedAt":"2024-06-27T14:22:15.810Z","hidden":false},{"_id":"667cbc2b557fe99a9eff62e0","user":{"_id":"613940c0905b1938233881e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/613940c0905b1938233881e3/Vb4kFWKEq6AILUP9KmCZA.png","isPro":false,"fullname":"Xindi Wu","user":"xindiw","type":"user"},"name":"Xindi Wu","status":"admin_assigned","statusLastChangedAt":"2024-06-27T14:22:22.566Z","hidden":false},{"_id":"667cbc2b557fe99a9eff62e1","user":{"_id":"63898b61ec1f539adc0f4da2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674248167280-63898b61ec1f539adc0f4da2.jpeg","isPro":false,"fullname":"Haotian Liu","user":"liuhaotian","type":"user"},"name":"Haotian Liu","status":"admin_assigned","statusLastChangedAt":"2024-06-27T14:22:43.317Z","hidden":false},{"_id":"667cbc2b557fe99a9eff62e2","user":{"_id":"6617f95a0ea62dc58800740a","avatarUrl":"/avatars/264b05ebe8a74ffed8c5914bdcc6d5d3.svg","isPro":false,"fullname":"Sadhika Malladi","user":"sadhikamalladi","type":"user"},"name":"Sadhika Malladi","status":"admin_assigned","statusLastChangedAt":"2024-06-27T14:23:01.673Z","hidden":false},{"_id":"667cbc2b557fe99a9eff62e3","user":{"_id":"63127c55f568fb0098f7e077","avatarUrl":"/avatars/d3e3bce2e956aeea9ced3ace6173a14c.svg","isPro":false,"fullname":"Alexis Chevalier","user":"AlexChvl","type":"user"},"name":"Alexis Chevalier","status":"admin_assigned","statusLastChangedAt":"2024-06-27T14:23:07.186Z","hidden":false},{"_id":"667cbc2b557fe99a9eff62e4","name":"Sanjeev Arora","hidden":false},{"_id":"667cbc2b557fe99a9eff62e5","user":{"_id":"653a4ca8669b77a9f8034ec0","avatarUrl":"/avatars/005ba9ead6b95b9ee8a31d050de77cee.svg","isPro":false,"fullname":"Danqi Chen","user":"cdq10131","type":"user"},"name":"Danqi Chen","status":"admin_assigned","statusLastChangedAt":"2024-06-27T14:23:21.405Z","hidden":false}],"publishedAt":"2024-06-26T17:50:11.000Z","submittedOnDailyAt":"2024-06-27T01:09:49.498Z","title":"CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal\n LLMs","submittedOnDailyBy":{"_id":"641a38fdfb5ffff5ac78ceb0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641a38fdfb5ffff5ac78ceb0/53nGstZ9Ya5WfeGdXAXCr.png","isPro":true,"fullname":"Zirui Wang","user":"zwcolin","type":"user"},"summary":"Chart understanding plays a pivotal role when applying Multimodal Large\nLanguage Models (MLLMs) to real-world tasks such as analyzing scientific papers\nor financial reports. However, existing datasets often focus on oversimplified\nand homogeneous charts with template-based questions, leading to an\nover-optimistic measure of progress. We demonstrate that although open-source\nmodels can appear to outperform strong proprietary models on these benchmarks,\na simple stress test with slightly different charts or questions can\ndeteriorate performance by up to 34.5%. In this work, we propose CharXiv, a\ncomprehensive evaluation suite involving 2,323 natural, challenging, and\ndiverse charts from arXiv papers. CharXiv includes two types of questions: 1)\ndescriptive questions about examining basic chart elements and 2) reasoning\nquestions that require synthesizing information across complex visual elements\nin the chart. To ensure quality, all charts and questions are handpicked,\ncurated, and verified by human experts. Our results reveal a substantial,\npreviously underestimated gap between the reasoning skills of the strongest\nproprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the\nstrongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%.\nAll models lag far behind human performance of 80.5%, underscoring weaknesses\nin the chart understanding capabilities of existing MLLMs. We hope CharXiv\nfacilitates future research on MLLM chart understanding by providing a more\nrealistic and faithful measure of progress. Project page and leaderboard:\nhttps://charxiv.github.io/","upvotes":30,"discussionId":"667cbc2d557fe99a9eff63c7","githubRepo":"https://github.com/princeton-nlp/CharXiv","githubRepoAddedBy":"auto","ai_summary":"CharXiv evaluates the chart understanding capabilities of MLLMs using a diverse and challenging dataset, uncovering significant weaknesses in model performance compared to human capabilities.","ai_keywords":["Multimodal Large Language Models","MLLMs","CharXiv","descriptive questions","reasoning questions","chart understanding"],"githubStars":140},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"641a38fdfb5ffff5ac78ceb0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641a38fdfb5ffff5ac78ceb0/53nGstZ9Ya5WfeGdXAXCr.png","isPro":true,"fullname":"Zirui Wang","user":"zwcolin","type":"user"},{"_id":"607f846419a5af0183d7bfb9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1618969698200-noauth.png","isPro":false,"fullname":"Princeton NLP group","user":"princeton-nlp","type":"user"},{"_id":"6512edd9606fc201922ddc9a","avatarUrl":"/avatars/cba4b806e7d6f8e9d4eab718f741f709.svg","isPro":false,"fullname":"Lucy He","user":"Lumos23","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"65e7c3a2706749f643c3386e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/cAZCcbJRizBCR7vCsWZUw.jpeg","isPro":false,"fullname":"Yitao Liu","user":"yitaoliu917","type":"user"},{"_id":"64b7706d54b25908c3c8b675","avatarUrl":"/avatars/8249a9e98be373f4e7ab2fe27f6d4b8a.svg","isPro":false,"fullname":"Cheng Yang","user":"YANG-Cheng","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"64ba096e760936217a3ad2e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ba096e760936217a3ad2e2/aNQK83Jg5PsBkY0UDg-RA.jpeg","isPro":false,"fullname":"Linzheng Chai","user":"Challenging666","type":"user"},{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","isPro":false,"fullname":"Minbyul Jeong","user":"Minbyul","type":"user"},{"_id":"6661bd104405867ef3f8adf2","avatarUrl":"/avatars/20ee6333bac092dfba7321d7e10d850b.svg","isPro":false,"fullname":"Zack Kiener","user":"zkiener","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"63ddc7b80f6d2d6c3efe3600","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ddc7b80f6d2d6c3efe3600/RX5q9T80Jl3tn6z03ls0l.jpeg","isPro":false,"fullname":"J","user":"dashfunnydashdash","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2406.18521

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Published on Jun 26, 2024
ยท Submitted by
Zirui Wang
on Jun 27, 2024

Abstract

CharXiv evaluates the chart understanding capabilities of MLLMs using a diverse and challenging dataset, uncovering significant weaknesses in model performance compared to human capabilities.

AI-generated summary

Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/

Community

Paper author Paper submitter

๐Ÿคจ Are Multimodal Large Language Models really as ๐ ๐จ๐จ๐ at ๐œ๐ก๐š๐ซ๐ญ ๐ฎ๐ง๐๐ž๐ซ๐ฌ๐ญ๐š๐ง๐๐ข๐ง๐  as existing benchmarks such as ChartQA suggest?

๐Ÿšซ Our โ„‚๐•™๐•’๐•ฃ๐•๐•š๐•ง benchmark suggests NO!
๐Ÿฅ‡Humans achieve โœจ๐Ÿ–๐ŸŽ+% correctness.
๐ŸฅˆSonnet 3.5 outperforms GPT-4o by 10+ points, reaching ๐ŸŒŸ๐Ÿ”๐ŸŽ% correctness.
๐Ÿฅ‰Open-weight models are capped at โญ๐Ÿ‘๐Ÿ% correctness.

๐Ÿชœ Leaderboard: https://charxiv.github.io/#leaderboard
๐Ÿ“œ Preprint: https://arxiv.org/abs/2406.18521

๐Ÿ“Š Charxiv is โœจ๐Ÿ๐ŸŽ๐ŸŽ% handcrafted with rigorous human validation, and it reveals substantial gaps among Multimodal Large Language Models and humans in chart understanding.

Kudos @zwcolin and team. I've featured this paper in my AI research newsletter www.aitidbits.ai/p/july-4th-2024#:~:text=of%20human%20performance-,Princeton,-develops

Looking forward to more novel papers and methods.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.18521 in a model README.md to link it from this page.

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.18521 in a Space README.md to link it from this page.

Collections including this paper 10