https://stateval.github.io/.

\n","updatedAt":"2025-10-13T03:24:58.136Z","author":{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","fullname":"Runpeng Dai","name":"Leo-Dai","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9188253283500671},"editors":["Leo-Dai"],"editorAvatarUrls":["/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg"],"reactions":[],"isReport":false}},{"id":"68eda8ebed29a6dc91b8f80f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false},"createdAt":"2025-10-14T01:35:39.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving](https://huggingface.co/papers/2509.17677) (2025)\n* [Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers](https://huggingface.co/papers/2509.03059) (2025)\n* [IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation](https://huggingface.co/papers/2509.26076) (2025)\n* [FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory](https://huggingface.co/papers/2510.02335) (2025)\n* [Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness](https://huggingface.co/papers/2508.18824) (2025)\n* [THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning](https://huggingface.co/papers/2509.13761) (2025)\n* [Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem](https://huggingface.co/papers/2509.06809) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-10-14T01:35:39.679Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7163836359977722},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"691199850570b41ffe3628fc","author":{"_id":"68ba732ba2faa389d22f1738","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/FtYb8r7yzHKO5hmvsvXRA.png","fullname":"Ez Wang","name":"Ezez223","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-11-10T07:51:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"So where is the dataset ?","html":"

So where is the dataset ?

\n","updatedAt":"2025-11-10T07:51:33.069Z","author":{"_id":"68ba732ba2faa389d22f1738","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/FtYb8r7yzHKO5hmvsvXRA.png","fullname":"Ez Wang","name":"Ezez223","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8092650175094604},"editors":["Ezez223"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/FtYb8r7yzHKO5hmvsvXRA.png"],"reactions":[],"isReport":false}},{"id":"69122556e8cfe2376e516845","author":{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","fullname":"Runpeng Dai","name":"Leo-Dai","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-11-10T17:48:06.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi Ezez,\n\nYou should be able to find the datset here 0v01111/StatEval-Foundational-knowledge and 0v01111/StatEval-Statistical-Research. You can find the link to them on our project page.","html":"

Hi Ezez,

You should be able to find the datset here 0v01111/StatEval-Foundational-knowledge and 0v01111/StatEval-Statistical-Research. You can find the link to them on our project page.

\n","updatedAt":"2025-11-10T17:48:06.778Z","author":{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","fullname":"Runpeng Dai","name":"Leo-Dai","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6705175042152405},"editors":["Leo-Dai"],"editorAvatarUrls":["/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.09517","authors":[{"_id":"68ec7109cd07fb414898c9da","name":"Yuchen Lu","hidden":false},{"_id":"68ec7109cd07fb414898c9db","name":"Run Yang","hidden":false},{"_id":"68ec7109cd07fb414898c9dc","name":"Yichen Zhang","hidden":false},{"_id":"68ec7109cd07fb414898c9dd","name":"Shuguang Yu","hidden":false},{"_id":"68ec7109cd07fb414898c9de","user":{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","isPro":true,"fullname":"Runpeng Dai","user":"Leo-Dai","type":"user"},"name":"Runpeng Dai","status":"claimed_verified","statusLastChangedAt":"2025-10-13T10:05:24.400Z","hidden":false},{"_id":"68ec7109cd07fb414898c9df","name":"Ziwei Wang","hidden":false},{"_id":"68ec7109cd07fb414898c9e0","name":"Jiayi Xiang","hidden":false},{"_id":"68ec7109cd07fb414898c9e1","name":"Wenxin E","hidden":false},{"_id":"68ec7109cd07fb414898c9e2","name":"Siran Gao","hidden":false},{"_id":"68ec7109cd07fb414898c9e3","name":"Xinyao Ruan","hidden":false},{"_id":"68ec7109cd07fb414898c9e4","name":"Yirui Huang","hidden":false},{"_id":"68ec7109cd07fb414898c9e5","name":"Chenjing Xi","hidden":false},{"_id":"68ec7109cd07fb414898c9e6","name":"Haibo Hu","hidden":false},{"_id":"68ec7109cd07fb414898c9e7","name":"Yueming Fu","hidden":false},{"_id":"68ec7109cd07fb414898c9e8","name":"Qinglan Yu","hidden":false},{"_id":"68ec7109cd07fb414898c9e9","name":"Xiaobing Wei","hidden":false},{"_id":"68ec7109cd07fb414898c9ea","name":"Jiani Gu","hidden":false},{"_id":"68ec7109cd07fb414898c9eb","name":"Rui Sun","hidden":false},{"_id":"68ec7109cd07fb414898c9ec","name":"Jiaxuan Jia","hidden":false},{"_id":"68ec7109cd07fb414898c9ed","name":"Fan Zhou","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65037565da2d88e201f63b7a/2h1nHiFO7alGKj79s1cyr.png"],"publishedAt":"2025-10-10T16:28:43.000Z","submittedOnDailyAt":"2025-10-13T01:54:58.129Z","title":"StatEval: A Comprehensive Benchmark for Large Language Models in\n Statistics","submittedOnDailyBy":{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","isPro":true,"fullname":"Runpeng Dai","user":"Leo-Dai","type":"user"},"summary":"Large language models (LLMs) have demonstrated remarkable advances in\nmathematical and logical reasoning, yet statistics, as a distinct and\nintegrative discipline, remains underexplored in benchmarking efforts. To\naddress this gap, we introduce StatEval, the first comprehensive\nbenchmark dedicated to statistics, spanning both breadth and depth across\ndifficulty levels. StatEval consists of 13,817 foundational problems covering\nundergraduate and graduate curricula, together with 2374 research-level proof\ntasks extracted from leading journals. To construct the benchmark, we design a\nscalable multi-agent pipeline with human-in-the-loop validation that automates\nlarge-scale problem extraction, rewriting, and quality control, while ensuring\nacademic rigor. We further propose a robust evaluation framework tailored to\nboth computational and proof-based tasks, enabling fine-grained assessment of\nreasoning ability. Experimental results reveal that while closed-source models\nsuch as GPT5-mini achieve below 57\\% on research-level problems, with\nopen-source models performing significantly lower. These findings highlight the\nunique challenges of statistical reasoning and the limitations of current LLMs.\nWe expect StatEval to serve as a rigorous benchmark for advancing statistical\nintelligence in large language models. All data and code are available on our\nweb platform: https://stateval.github.io/.","upvotes":7,"discussionId":"68ec7109cd07fb414898c9ee","projectPage":"https://stateval.github.io/","githubRepo":"https://github.com/StatEval/StatEval","githubRepoAddedBy":"auto","ai_summary":"StatEval is a comprehensive benchmark for statistical reasoning, covering foundational and research-level problems, and highlights the limitations of current LLMs in this domain.","ai_keywords":["large language models","LLMs","mathematical reasoning","logical reasoning","statistics","benchmark","foundational problems","research-level proof tasks","multi-agent pipeline","human-in-the-loop validation","evaluation framework","computational tasks","proof-based tasks","reasoning ability","GPT5-mini","statistical intelligence"],"githubStars":39,"organization":{"_id":"641ba0cea63c4e8062373b0c","name":"SUFE","fullname":"Shanghai University of Finance and Economics","avatar":"https://cdn-uploads.huggingface.co/production/uploads/641b9fdbaebaa27e0752f494/VeZw5Qs9k0KkqujT7aLJQ.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","isPro":true,"fullname":"Runpeng Dai","user":"Leo-Dai","type":"user"},{"_id":"6453b769fc2b5f69e8fb953b","avatarUrl":"/avatars/179fec1b5e1a848fb318b7d5aa894fb1.svg","isPro":false,"fullname":"Greatliar lyc","user":"Greatliar","type":"user"},{"_id":"60bccec062080d33f875cd0c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bccec062080d33f875cd0c/KvEhYxx9-Tff_Qb7PsjAL.png","isPro":true,"fullname":"Peter Szemraj","user":"pszemraj","type":"user"},{"_id":"6656bf615b203a05a1f0968c","avatarUrl":"/avatars/1ee0b0099c10dd76c8e3b7d312221b15.svg","isPro":false,"fullname":"Rui Liu","user":"lr10260","type":"user"},{"_id":"6623ea65b642e29cdf90a1b4","avatarUrl":"/avatars/e32e90574c1162b2be87ed78604e3e4d.svg","isPro":true,"fullname":"TongZheng","user":"TongZheng1999","type":"user"},{"_id":"647073267fd7ecdbd0e9f06c","avatarUrl":"/avatars/8d7d7d257d764edeab0580e16a606571.svg","isPro":false,"fullname":"runyang","user":"0v01111","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"641ba0cea63c4e8062373b0c","name":"SUFE","fullname":"Shanghai University of Finance and Economics","avatar":"https://cdn-uploads.huggingface.co/production/uploads/641b9fdbaebaa27e0752f494/VeZw5Qs9k0KkqujT7aLJQ.jpeg"}}">

Papers

arxiv:2510.09517

StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

Published on Oct 10, 2025

· Submitted by

Runpeng Dai on Oct 13, 2025

Shanghai University of Finance and Economics

Upvote

Authors:

Runpeng Dai ,

Abstract

StatEval is a comprehensive benchmark for statistical reasoning, covering foundational and research-level problems, and highlights the limitations of current LLMs in this domain.

AI-generated summary

Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce StatEval, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57\% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.

View arXiv page View PDF Project page GitHub 39 auto Add to collection

Community

Leo-Dai

Paper author Paper submitter Oct 13, 2025

Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce StatEval, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform https://stateval.github.io/.

librarian-bot

Oct 14, 2025

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Ezez223

Nov 10, 2025

So where is the dataset ?

Leo-Dai

Paper author Paper submitter Nov 10, 2025

Hi Ezez,

You should be able to find the datset here 0v01111/StatEval-Foundational-knowledge and 0v01111/StatEval-Statistical-Research. You can find the link to them on our project page.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.09517 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.09517 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.09517 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.