This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\nThe following papers were recommended by the Semantic Scholar API
\n- \n
- VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning (2024) \n
- TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions (2024) \n
- ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning (2024) \n
- Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning (2024) \n
- VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning (2024) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
Perfect ideas!
One small question:
How do you generate ground-truth solutions to the Program-based Generated questions?
Thanks \n\n@julyai\n\t ! This is a great question. When we create programs, we ensure that both the questions and their corresponding ground-truth answers are generated together. For example, for the shifted absolute value question in Figure 1, the shift amount is a random variable in our program, and our program can easily calculate the ground-truth solution based on its value (e.g., when the shift is non-zero, the function is differentiable at x=0; it is non-differentiable otherwise). Other problems are created in a similar manner, where both questions and ground-truth solutions are generated by the program.
\nThanks again for the question and feel free to let us know if there is anything else to clarify :)
\n","updatedAt":"2024-11-19T09:38:02.490Z","author":{"_id":"6719d581a6cad13741b8bc7f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6719d581a6cad13741b8bc7f/w4EttqfXRgWZJc6HpYOS9.jpeg","fullname":"Huan Zhang","name":"huanzhang12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9226251244544983},"editors":["huanzhang12"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6719d581a6cad13741b8bc7f/w4EttqfXRgWZJc6HpYOS9.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"673c2dbdb3eb061c510b0991"}}]},{"id":"673c6033adfdcc651e1410d3","author":{"_id":"671905bf238d8cd549f0f740","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/671905bf238d8cd549f0f740/zmUNld9HqfFklMi2iWDrU.jpeg","fullname":"jiaxin-ai(Sii)","name":"julyai","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2024-11-19T09:53:55.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks! Totally get!","html":"Thanks! Totally get!
\n","updatedAt":"2024-11-19T09:53:55.640Z","author":{"_id":"671905bf238d8cd549f0f740","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/671905bf238d8cd549f0f740/zmUNld9HqfFklMi2iWDrU.jpeg","fullname":"jiaxin-ai(Sii)","name":"julyai","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7324236631393433},"editors":["julyai"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/671905bf238d8cd549f0f740/zmUNld9HqfFklMi2iWDrU.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2411.00836","authors":[{"_id":"67297e6f6cc63cc1ba29886a","user":{"_id":"660a37f54fa3a72a97b09546","avatarUrl":"/avatars/d6306e87e9343abfd16c1fe08a2f9358.svg","isPro":false,"fullname":"Chengke Zou","user":"OwenZou","type":"user"},"name":"Chengke Zou","status":"admin_assigned","statusLastChangedAt":"2024-11-05T14:03:14.097Z","hidden":false},{"_id":"67297e6f6cc63cc1ba29886b","user":{"_id":"6524837973a0f19d06fdc851","avatarUrl":"/avatars/2bc59dddf66ff05afe83bb0711a8a5a3.svg","isPro":false,"fullname":"Xingang Guo","user":"optizer","type":"user"},"name":"Xingang Guo","status":"claimed_verified","statusLastChangedAt":"2024-11-15T09:38:25.601Z","hidden":false},{"_id":"67297e6f6cc63cc1ba29886c","user":{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","isPro":true,"fullname":"Rui Yang","user":"Ray2333","type":"user"},"name":"Rui Yang","status":"claimed_verified","statusLastChangedAt":"2024-11-05T07:58:11.859Z","hidden":false},{"_id":"67297e6f6cc63cc1ba29886d","user":{"_id":"6719bfd07c6e6c83a388aeae","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6719bfd07c6e6c83a388aeae/jHxryk04dzHo23TX5F5sz.png","isPro":false,"fullname":"Junyu Zhang","user":"jyzhang1208","type":"user"},"name":"Junyu Zhang","status":"admin_assigned","statusLastChangedAt":"2024-11-05T14:03:42.928Z","hidden":false},{"_id":"67297e6f6cc63cc1ba29886e","user":{"_id":"67b3bd862f8e45db9fc2eadf","avatarUrl":"/avatars/d53454493c22978d13e9ca410a0159c7.svg","isPro":false,"fullname":"Bin Hu","user":"huxxx221","type":"user"},"name":"Bin Hu","status":"extracted_pending","statusLastChangedAt":"2025-02-17T22:52:47.594Z","hidden":true},{"_id":"67297e6f6cc63cc1ba29886f","user":{"_id":"6719d581a6cad13741b8bc7f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6719d581a6cad13741b8bc7f/w4EttqfXRgWZJc6HpYOS9.jpeg","isPro":false,"fullname":"Huan Zhang","user":"huanzhang12","type":"user"},"name":"Huan Zhang","status":"admin_assigned","statusLastChangedAt":"2024-11-05T14:04:05.546Z","hidden":false}],"publishedAt":"2024-10-29T17:29:19.000Z","submittedOnDailyAt":"2024-11-04T23:47:47.628Z","title":"DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical\n Reasoning Robustness of Vision Language Models","submittedOnDailyBy":{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","isPro":true,"fullname":"Rui Yang","user":"Ray2333","type":"user"},"summary":"The rapid advancements in Vision-Language Models (VLMs) have shown great\npotential in tackling mathematical reasoning tasks that involve visual context.\nUnlike humans who can reliably apply solution steps to similar problems with\nminor modifications, we found that SOTA VLMs like GPT-4o can consistently fail\nin these scenarios, revealing limitations in their mathematical reasoning\ncapabilities. In this paper, we investigate the mathematical reasoning\nrobustness in VLMs and evaluate how well these models perform under different\nvariants of the same question, such as changes in visual numerical values or\nfunction graphs. While several vision-based math benchmarks have been developed\nto assess VLMs' problem-solving capabilities, these benchmarks contain only\nstatic sets of problems and cannot easily evaluate mathematical reasoning\nrobustness. To fill this gap, we introduce DynaMath, a dynamic visual math\nbenchmark designed for in-depth assessment of VLMs. DynaMath includes 501\nhigh-quality, multi-topic seed questions, each represented as a Python program.\nThose programs are carefully designed and annotated to enable the automatic\ngeneration of a much larger set of concrete questions, including many different\ntypes of visual and textual variations. DynaMath allows us to evaluate the\ngeneralization ability of VLMs, by assessing their performance under varying\ninput conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010\ngenerated concrete questions. Our results show that the worst-case model\naccuracy, defined as the percentage of correctly answered seed questions in all\n10 variants, is significantly lower than the average-case accuracy. Our\nanalysis emphasizes the need to study the robustness of VLMs' reasoning\nabilities, and DynaMath provides valuable insights to guide the development of\nmore reliable models for mathematical reasoning.","upvotes":15,"discussionId":"67297e716cc63cc1ba2988f9","projectPage":"https://dynamath.github.io","githubRepo":"https://github.com/DynaMath/DynaMath","githubRepoAddedBy":"user","ai_summary":"DynaMath, a dynamic visual math benchmark, evaluates the robustness of Vision-Language Models in mathematical reasoning across various input conditions, revealing significant accuracy differences between average and worst-case scenarios.","ai_keywords":["Vision-Language Models","VLMs","DynaMath","mathematical reasoning","generalization ability","seed questions","concrete questions","visual variations","textual variations","worst-case model accuracy","average-case accuracy"],"githubStars":28},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","isPro":true,"fullname":"Rui Yang","user":"Ray2333","type":"user"},{"_id":"6719d581a6cad13741b8bc7f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6719d581a6cad13741b8bc7f/w4EttqfXRgWZJc6HpYOS9.jpeg","isPro":false,"fullname":"Huan Zhang","user":"huanzhang12","type":"user"},{"_id":"6363a4f4ff4b318d1b775420","avatarUrl":"/avatars/c709a528db30fd81865de040710b4578.svg","isPro":false,"fullname":"Luo","user":"amandaa","type":"user"},{"_id":"66f9bb2dd5575ad6914756ce","avatarUrl":"/avatars/221d915a5386cbb11c007dc7c41d6b0a.svg","isPro":true,"fullname":"Feng Luo","user":"feng0929","type":"user"},{"_id":"660a37f54fa3a72a97b09546","avatarUrl":"/avatars/d6306e87e9343abfd16c1fe08a2f9358.svg","isPro":false,"fullname":"Chengke Zou","user":"OwenZou","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6524837973a0f19d06fdc851","avatarUrl":"/avatars/2bc59dddf66ff05afe83bb0711a8a5a3.svg","isPro":false,"fullname":"Xingang Guo","user":"optizer","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"5f32b2367e583543386214d9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1635314457124-5f32b2367e583543386214d9.jpeg","isPro":false,"fullname":"Sergei Averkiev","user":"averoo","type":"user"},{"_id":"63b2a92e18e5cf2cdd333492","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b2a92e18e5cf2cdd333492/GxnngJG0u7d0jYTEFOrfe.png","isPro":false,"fullname":"Jaehyun Jun","user":"btjhjeon","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"641b754d1911d3be6745cce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b754d1911d3be6745cce9/Ydjcjd4VuNUGj5Cd4QHdB.png","isPro":false,"fullname":"atayloraerospace","user":"Taylor658","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
Abstract
DynaMath, a dynamic visual math benchmark, evaluates the robustness of Vision-Language Models in mathematical reasoning across various input conditions, revealing significant accuracy differences between average and worst-case scenarios.
The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of VLMs. DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program. Those programs are carefully designed and annotated to enable the automatic generation of a much larger set of concrete questions, including many different types of visual and textual variations. DynaMath allows us to evaluate the generalization ability of VLMs, by assessing their performance under varying input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010 generated concrete questions. Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy. Our analysis emphasizes the need to study the robustness of VLMs' reasoning abilities, and DynaMath provides valuable insights to guide the development of more reliable models for mathematical reasoning.
Community
Check out the dynamic visual math benchmark for evaluating reasoning robustness of VLMs. There are lots of interesting findings on SOTA VLMs' robustness performance. The GitHub and huggingface links are https://github.com/DynaMath/DynaMath and https://huggingface.co/datasets/DynaMath/DynaMath_Sample.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning (2024)
- TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions (2024)
- ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning (2024)
- Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning (2024)
- VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Perfect ideas!
One small question:
How do you generate ground-truth solutions to the Program-based Generated questions?
Thanks @julyai ! This is a great question. When we create programs, we ensure that both the questions and their corresponding ground-truth answers are generated together. For example, for the shifted absolute value question in Figure 1, the shift amount is a random variable in our program, and our program can easily calculate the ground-truth solution based on its value (e.g., when the shift is non-zero, the function is differentiable at x=0; it is non-differentiable otherwise). Other problems are created in a similar manner, where both questions and ground-truth solutions are generated by the program.
Thanks again for the question and feel free to let us know if there is anything else to clarify :)
Thanks! Totally get!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper