Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning
in LLMs
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-10-09T01:35:01.517Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2410.04698","authors":[{"_id":"6704d8f0e8d5909d35d32eea","user":{"_id":"646def60df618b303b419323","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646def60df618b303b419323/JLJGYen4-5M8ivsLsSk0w.jpeg","isPro":false,"fullname":"Lei Wang","user":"demolei","type":"user"},"name":"Lei Wang","status":"claimed_verified","statusLastChangedAt":"2024-10-08T07:18:43.902Z","hidden":false},{"_id":"6704d8f0e8d5909d35d32eeb","name":"Shan Dong","hidden":false},{"_id":"6704d8f0e8d5909d35d32eec","user":{"_id":"6602869253a0518b2a98cafd","avatarUrl":"/avatars/c14b5953a716f42c83ad28147f8308ae.svg","isPro":false,"fullname":"Yuhui Xu","user":"yuhuixu","type":"user"},"name":"Yuhui Xu","status":"admin_assigned","statusLastChangedAt":"2024-10-08T09:34:06.566Z","hidden":false},{"_id":"6704d8f0e8d5909d35d32eed","user":{"_id":"63a3ff69f91ad3ea5703841d","avatarUrl":"/avatars/69227c4bce01d33747c1377b6f9672db.svg","isPro":false,"fullname":"Hanze Dong","user":"hendrydong","type":"user"},"name":"Hanze Dong","status":"admin_assigned","statusLastChangedAt":"2024-10-08T09:33:48.011Z","hidden":false},{"_id":"6704d8f0e8d5909d35d32eee","user":{"_id":"656741fca637eb616aff92ba","avatarUrl":"/avatars/6a0c7ac44948406b191d27fd2157d829.svg","isPro":false,"fullname":"Yalu Wang","user":"lunshi","type":"user"},"name":"Yalu Wang","status":"admin_assigned","statusLastChangedAt":"2024-10-08T09:33:41.725Z","hidden":false},{"_id":"6704d8f0e8d5909d35d32eef","name":"Amrita Saha","hidden":false},{"_id":"6704d8f0e8d5909d35d32ef0","name":"Ee-Peng Lim","hidden":false},{"_id":"6704d8f0e8d5909d35d32ef1","user":{"_id":"649dbcc4e0fff1ed099dc80a","avatarUrl":"/avatars/c87c273ca628dbcddccbf1ee19b2ce33.svg","isPro":false,"fullname":"Caiming Xiong","user":"cxiong","type":"user"},"name":"Caiming Xiong","status":"admin_assigned","statusLastChangedAt":"2024-10-08T09:33:30.216Z","hidden":false},{"_id":"6704d8f0e8d5909d35d32ef2","user":{"_id":"65f84fd980481173afd91233","avatarUrl":"/avatars/6ac7bd6beba24d1476c5179b88c9e3fa.svg","isPro":false,"fullname":"Doyen","user":"doyensahoo","type":"user"},"name":"Doyen Sahoo","status":"admin_assigned","statusLastChangedAt":"2024-10-08T09:33:24.218Z","hidden":false}],"publishedAt":"2024-10-07T02:30:07.000Z","submittedOnDailyAt":"2024-10-08T05:33:52.754Z","title":"MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning\n in LLMs","submittedOnDailyBy":{"_id":"646def60df618b303b419323","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646def60df618b303b419323/JLJGYen4-5M8ivsLsSk0w.jpeg","isPro":false,"fullname":"Lei Wang","user":"demolei","type":"user"},"summary":"Recent large language models (LLMs) have demonstrated versatile capabilities\nin long-context scenarios. Although some recent benchmarks have been developed\nto evaluate the long-context capabilities of LLMs, there is a lack of\nbenchmarks evaluating the mathematical reasoning abilities of LLMs over long\ncontexts, which is crucial for LLMs' application in real-world scenarios. In\nthis paper, we introduce MathHay, an automated benchmark designed to assess the\nlong-context mathematical reasoning capabilities of LLMs. Unlike previous\nbenchmarks like Needle in a Haystack, which focus primarily on information\nretrieval within long texts, MathHay demands models with both\ninformation-seeking and complex mathematical reasoning abilities. We conduct\nextensive experiments on MathHay to assess the long-context mathematical\nreasoning abilities of eight top-performing LLMs. Even the best-performing\nmodel, Gemini-1.5-Pro-002, still struggles with mathematical reasoning over\nlong contexts, achieving only 51.26% accuracy at 128K tokens. This highlights\nthe significant room for improvement on the MathHay benchmark.","upvotes":13,"discussionId":"6704d8f2e8d5909d35d32f74","ai_summary":"MathHay is an automated benchmark assessing LLMs' long-context mathematical reasoning abilities, revealing significant areas for improvement.","ai_keywords":["large language models","long-context","mathematical reasoning","MathHay","information-seeking","complex mathematical reasoning","Gemini-1.5-Pro-002"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"646def60df618b303b419323","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646def60df618b303b419323/JLJGYen4-5M8ivsLsSk0w.jpeg","isPro":false,"fullname":"Lei Wang","user":"demolei","type":"user"},{"_id":"66aa0c1788d5d0c59ee75b88","avatarUrl":"/avatars/62ea61f078c18853096a8f81dc2fe03d.svg","isPro":false,"fullname":"Shan Dong","user":"shansarah","type":"user"},{"_id":"619c6ffbb392787f0f3ead67","avatarUrl":"/avatars/912e8031e7c86ceabe71f66f9d89262d.svg","isPro":false,"fullname":"eason lai","user":"lai","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"641b754d1911d3be6745cce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b754d1911d3be6745cce9/Ydjcjd4VuNUGj5Cd4QHdB.png","isPro":false,"fullname":"atayloraerospace","user":"Taylor658","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6424419a7ac80fc1eafb3920","avatarUrl":"/avatars/081981854b2360e8f7f5d12ea3d950e9.svg","isPro":false,"fullname":"Yihuai Lan","user":"lanyihuai","type":"user"},{"_id":"647e97a133493e1c433d9e0a","avatarUrl":"/avatars/654d95c4879133fc4ed528d90dcf65e6.svg","isPro":false,"fullname":"Wei Qin","user":"WeiAir","type":"user"},{"_id":"6602869253a0518b2a98cafd","avatarUrl":"/avatars/c14b5953a716f42c83ad28147f8308ae.svg","isPro":false,"fullname":"Yuhui Xu","user":"yuhuixu","type":"user"},{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","isPro":false,"fullname":"Minbyul Jeong","user":"Minbyul","type":"user"},{"_id":"62b0009c72043b05d29492b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b0009c72043b05d29492b2/NqRkX2YLhlfOLvYysa7dD.png","isPro":false,"fullname":"Li Lyna Zhang","user":"lynazhang","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
MathHay is an automated benchmark assessing LLMs' long-context mathematical reasoning abilities, revealing significant areas for improvement.
AI-generated summary
Recent large language models (LLMs) have demonstrated versatile capabilities
in long-context scenarios. Although some recent benchmarks have been developed
to evaluate the long-context capabilities of LLMs, there is a lack of
benchmarks evaluating the mathematical reasoning abilities of LLMs over long
contexts, which is crucial for LLMs' application in real-world scenarios. In
this paper, we introduce MathHay, an automated benchmark designed to assess the
long-contextmathematical reasoning capabilities of LLMs. Unlike previous
benchmarks like Needle in a Haystack, which focus primarily on information
retrieval within long texts, MathHay demands models with both
information-seeking and complex mathematical reasoning abilities. We conduct
extensive experiments on MathHay to assess the long-context mathematical
reasoning abilities of eight top-performing LLMs. Even the best-performing
model, Gemini-1.5-Pro-002, still struggles with mathematical reasoning over
long contexts, achieving only 51.26% accuracy at 128K tokens. This highlights
the significant room for improvement on the MathHay benchmark.