This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\nThe following papers were recommended by the Semantic Scholar API
\n- \n
- EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving (2025) \n
- Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers (2025) \n
- IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation (2025) \n
- FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory (2025) \n
- Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness (2025) \n
- THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning (2025) \n
- Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem (2025) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
So where is the dataset ?
\n","updatedAt":"2025-11-10T07:51:33.069Z","author":{"_id":"68ba732ba2faa389d22f1738","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/FtYb8r7yzHKO5hmvsvXRA.png","fullname":"Ez Wang","name":"Ezez223","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8092650175094604},"editors":["Ezez223"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/FtYb8r7yzHKO5hmvsvXRA.png"],"reactions":[],"isReport":false}},{"id":"69122556e8cfe2376e516845","author":{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","fullname":"Runpeng Dai","name":"Leo-Dai","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-11-10T17:48:06.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi Ezez,\n\nYou should be able to find the datset here 0v01111/StatEval-Foundational-knowledge and 0v01111/StatEval-Statistical-Research. You can find the link to them on our project page.","html":"Hi Ezez,
\nYou should be able to find the datset here 0v01111/StatEval-Foundational-knowledge and 0v01111/StatEval-Statistical-Research. You can find the link to them on our project page.
\n","updatedAt":"2025-11-10T17:48:06.778Z","author":{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","fullname":"Runpeng Dai","name":"Leo-Dai","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6705175042152405},"editors":["Leo-Dai"],"editorAvatarUrls":["/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.09517","authors":[{"_id":"68ec7109cd07fb414898c9da","name":"Yuchen Lu","hidden":false},{"_id":"68ec7109cd07fb414898c9db","name":"Run Yang","hidden":false},{"_id":"68ec7109cd07fb414898c9dc","name":"Yichen Zhang","hidden":false},{"_id":"68ec7109cd07fb414898c9dd","name":"Shuguang Yu","hidden":false},{"_id":"68ec7109cd07fb414898c9de","user":{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","isPro":true,"fullname":"Runpeng Dai","user":"Leo-Dai","type":"user"},"name":"Runpeng Dai","status":"claimed_verified","statusLastChangedAt":"2025-10-13T10:05:24.400Z","hidden":false},{"_id":"68ec7109cd07fb414898c9df","name":"Ziwei Wang","hidden":false},{"_id":"68ec7109cd07fb414898c9e0","name":"Jiayi Xiang","hidden":false},{"_id":"68ec7109cd07fb414898c9e1","name":"Wenxin E","hidden":false},{"_id":"68ec7109cd07fb414898c9e2","name":"Siran Gao","hidden":false},{"_id":"68ec7109cd07fb414898c9e3","name":"Xinyao Ruan","hidden":false},{"_id":"68ec7109cd07fb414898c9e4","name":"Yirui Huang","hidden":false},{"_id":"68ec7109cd07fb414898c9e5","name":"Chenjing Xi","hidden":false},{"_id":"68ec7109cd07fb414898c9e6","name":"Haibo Hu","hidden":false},{"_id":"68ec7109cd07fb414898c9e7","name":"Yueming Fu","hidden":false},{"_id":"68ec7109cd07fb414898c9e8","name":"Qinglan Yu","hidden":false},{"_id":"68ec7109cd07fb414898c9e9","name":"Xiaobing Wei","hidden":false},{"_id":"68ec7109cd07fb414898c9ea","name":"Jiani Gu","hidden":false},{"_id":"68ec7109cd07fb414898c9eb","name":"Rui Sun","hidden":false},{"_id":"68ec7109cd07fb414898c9ec","name":"Jiaxuan Jia","hidden":false},{"_id":"68ec7109cd07fb414898c9ed","name":"Fan Zhou","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65037565da2d88e201f63b7a/2h1nHiFO7alGKj79s1cyr.png"],"publishedAt":"2025-10-10T16:28:43.000Z","submittedOnDailyAt":"2025-10-13T01:54:58.129Z","title":"StatEval: A Comprehensive Benchmark for Large Language Models in\n Statistics","submittedOnDailyBy":{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","isPro":true,"fullname":"Runpeng Dai","user":"Leo-Dai","type":"user"},"summary":"Large language models (LLMs) have demonstrated remarkable advances in\nmathematical and logical reasoning, yet statistics, as a distinct and\nintegrative discipline, remains underexplored in benchmarking efforts. To\naddress this gap, we introduce StatEval, the first comprehensive\nbenchmark dedicated to statistics, spanning both breadth and depth across\ndifficulty levels. StatEval consists of 13,817 foundational problems covering\nundergraduate and graduate curricula, together with 2374 research-level proof\ntasks extracted from leading journals. To construct the benchmark, we design a\nscalable multi-agent pipeline with human-in-the-loop validation that automates\nlarge-scale problem extraction, rewriting, and quality control, while ensuring\nacademic rigor. We further propose a robust evaluation framework tailored to\nboth computational and proof-based tasks, enabling fine-grained assessment of\nreasoning ability. Experimental results reveal that while closed-source models\nsuch as GPT5-mini achieve below 57\\% on research-level problems, with\nopen-source models performing significantly lower. These findings highlight the\nunique challenges of statistical reasoning and the limitations of current LLMs.\nWe expect StatEval to serve as a rigorous benchmark for advancing statistical\nintelligence in large language models. All data and code are available on our\nweb platform: https://stateval.github.io/.","upvotes":7,"discussionId":"68ec7109cd07fb414898c9ee","projectPage":"https://stateval.github.io/","githubRepo":"https://github.com/StatEval/StatEval","githubRepoAddedBy":"auto","ai_summary":"StatEval is a comprehensive benchmark for statistical reasoning, covering foundational and research-level problems, and highlights the limitations of current LLMs in this domain.","ai_keywords":["large language models","LLMs","mathematical reasoning","logical reasoning","statistics","benchmark","foundational problems","research-level proof tasks","multi-agent pipeline","human-in-the-loop validation","evaluation framework","computational tasks","proof-based tasks","reasoning ability","GPT5-mini","statistical intelligence"],"githubStars":39,"organization":{"_id":"641ba0cea63c4e8062373b0c","name":"SUFE","fullname":"Shanghai University of Finance and Economics","avatar":"https://cdn-uploads.huggingface.co/production/uploads/641b9fdbaebaa27e0752f494/VeZw5Qs9k0KkqujT7aLJQ.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","isPro":true,"fullname":"Runpeng Dai","user":"Leo-Dai","type":"user"},{"_id":"6453b769fc2b5f69e8fb953b","avatarUrl":"/avatars/179fec1b5e1a848fb318b7d5aa894fb1.svg","isPro":false,"fullname":"Greatliar lyc","user":"Greatliar","type":"user"},{"_id":"60bccec062080d33f875cd0c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bccec062080d33f875cd0c/KvEhYxx9-Tff_Qb7PsjAL.png","isPro":true,"fullname":"Peter Szemraj","user":"pszemraj","type":"user"},{"_id":"6656bf615b203a05a1f0968c","avatarUrl":"/avatars/1ee0b0099c10dd76c8e3b7d312221b15.svg","isPro":false,"fullname":"Rui Liu","user":"lr10260","type":"user"},{"_id":"6623ea65b642e29cdf90a1b4","avatarUrl":"/avatars/e32e90574c1162b2be87ed78604e3e4d.svg","isPro":true,"fullname":"TongZheng","user":"TongZheng1999","type":"user"},{"_id":"647073267fd7ecdbd0e9f06c","avatarUrl":"/avatars/8d7d7d257d764edeab0580e16a606571.svg","isPro":false,"fullname":"runyang","user":"0v01111","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"641ba0cea63c4e8062373b0c","name":"SUFE","fullname":"Shanghai University of Finance and Economics","avatar":"https://cdn-uploads.huggingface.co/production/uploads/641b9fdbaebaa27e0752f494/VeZw5Qs9k0KkqujT7aLJQ.jpeg"}}">StatEval: A Comprehensive Benchmark for Large Language Models in Statistics
Abstract
StatEval is a comprehensive benchmark for statistical reasoning, covering foundational and research-level problems, and highlights the limitations of current LLMs in this domain.
Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce StatEval, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57\% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.
Community
Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce StatEval, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform https://stateval.github.io/.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving (2025)
- Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers (2025)
- IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation (2025)
- FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory (2025)
- Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness (2025)
- THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning (2025)
- Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
So where is the dataset ?
Hi Ezez,
You should be able to find the datset here 0v01111/StatEval-Foundational-knowledge and 0v01111/StatEval-Statistical-Research. You can find the link to them on our project page.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper