Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification
of Scientific Research
amphora/SPOT-MetaData\n","updatedAt":"2025-05-20T05:48:15.718Z","author":{"_id":"60d3e619b8448e1785bbda2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d3e619b8448e1785bbda2a/q2re5u1HNwsCCyIMtid_I.jpeg","fullname":"GUIJIN SON","name":"amphora","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":78,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.41050830483436584},"editors":["amphora"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/60d3e619b8448e1785bbda2a/q2re5u1HNwsCCyIMtid_I.jpeg"],"reactions":[],"isReport":false}},{"id":"682d2e38d57ba1e4d11e5dc8","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-05-21T01:36:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research](https://huggingface.co/papers/2503.22989) (2025)\n* [YourBench: Easy Custom Evaluation Sets for Everyone](https://huggingface.co/papers/2504.01833) (2025)\n* [CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?](https://huggingface.co/papers/2503.21717) (2025)\n* [ArxivBench: Can LLMs Assist Researchers in Conducting Research?](https://huggingface.co/papers/2504.10496) (2025)\n* [LongCodeBench: Evaluating Coding LLMs at 1M Context Windows](https://huggingface.co/papers/2505.07897) (2025)\n* [LLMs Outperform Experts on Challenging Biology Benchmarks](https://huggingface.co/papers/2505.06108) (2025)\n* [SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models](https://huggingface.co/papers/2505.07247) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-05-21T01:36:56.661Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.743320107460022},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.11855","authors":[{"_id":"682c11fe08d047591841ebf1","user":{"_id":"60d3e619b8448e1785bbda2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d3e619b8448e1785bbda2a/q2re5u1HNwsCCyIMtid_I.jpeg","isPro":true,"fullname":"GUIJIN SON","user":"amphora","type":"user"},"name":"Guijin Son","status":"admin_assigned","statusLastChangedAt":"2025-05-20T08:40:55.774Z","hidden":false},{"_id":"682c11fe08d047591841ebf2","user":{"_id":"6415c043486c7c9a5d151583","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6415c043486c7c9a5d151583/fUdYFh6iVh57swCkBEy-y.jpeg","isPro":false,"fullname":"Jiwoo Hong","user":"JW17","type":"user"},"name":"Jiwoo Hong","status":"admin_assigned","statusLastChangedAt":"2025-05-20T08:41:14.541Z","hidden":false},{"_id":"682c11fe08d047591841ebf3","name":"Honglu Fan","hidden":false},{"_id":"682c11fe08d047591841ebf4","user":{"_id":"659f9445d5c4ea912705aa4d","avatarUrl":"/avatars/1d3297c3ccad48e5eb6c01e0640dc06d.svg","isPro":false,"fullname":"Heejeong Nam","user":"HazelNam","type":"user"},"name":"Heejeong Nam","status":"admin_assigned","statusLastChangedAt":"2025-05-20T08:41:30.411Z","hidden":false},{"_id":"682c11fe08d047591841ebf5","user":{"_id":"63e087b6a98d931aa90c1b9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e087b6a98d931aa90c1b9c/4ZnfL0U8rrj3cNhj7WTgo.jpeg","isPro":false,"fullname":"Hyunwoo Ko","user":"Cartinoe5930","type":"user"},"name":"Hyunwoo Ko","status":"claimed_verified","statusLastChangedAt":"2025-05-20T07:20:13.177Z","hidden":false},{"_id":"682c11fe08d047591841ebf6","user":{"_id":"63be1cd13b0665ad51d29c37","avatarUrl":"/avatars/5acc9b9bbecac3d567e927e2d8667b00.svg","isPro":false,"fullname":"Seungwon Lim","user":"sngwon","type":"user"},"name":"Seungwon Lim","status":"admin_assigned","statusLastChangedAt":"2025-05-20T08:41:49.745Z","hidden":false},{"_id":"682c11fe08d047591841ebf7","name":"Jinyeop Song","hidden":false},{"_id":"682c11fe08d047591841ebf8","name":"Jinha Choi","hidden":false},{"_id":"682c11fe08d047591841ebf9","name":"Gonçalo Paulo","hidden":false},{"_id":"682c11fe08d047591841ebfa","name":"Youngjae Yu","hidden":false},{"_id":"682c11fe08d047591841ebfb","user":{"_id":"60347d3660e3dd96631c9093","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60347d3660e3dd96631c9093/B3fuZer5N04tZIAYrLnz4.jpeg","isPro":false,"fullname":"Stella Biderman","user":"stellaathena","type":"user"},"name":"Stella Biderman","status":"claimed_verified","statusLastChangedAt":"2025-06-07T05:49:57.112Z","hidden":false}],"publishedAt":"2025-05-17T05:45:16.000Z","submittedOnDailyAt":"2025-05-20T04:18:15.709Z","title":"When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification\n of Scientific Research","submittedOnDailyBy":{"_id":"60d3e619b8448e1785bbda2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d3e619b8448e1785bbda2a/q2re5u1HNwsCCyIMtid_I.jpeg","isPro":true,"fullname":"GUIJIN SON","user":"amphora","type":"user"},"summary":"Recent advances in large language models (LLMs) have fueled the vision of\nautomated scientific discovery, often called AI Co-Scientists. To date, prior\nwork casts these systems as generative co-authors responsible for crafting\nhypotheses, synthesizing code, or drafting manuscripts. In this work, we\nexplore a complementary application: using LLMs as verifiers to automate the\nacademic verification of scientific manuscripts. To that end, we\nintroduce SPOT, a dataset of 83 published papers paired with 91 errors\nsignificant enough to prompt errata or retraction, cross-validated with actual\nauthors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find\nthat none surpasses 21.1\\% recall or 6.1\\% precision (o3 achieves the best\nscores, with all others near zero). Furthermore, confidence estimates are\nuniformly low, and across eight independent runs, models rarely rediscover the\nsame errors, undermining their reliability. Finally, qualitative analysis with\ndomain experts reveals that even the strongest models make mistakes resembling\nstudent-level misconceptions derived from misunderstandings. These findings\nhighlight the substantial gap between current LLM capabilities and the\nrequirements for dependable AI-assisted academic verification.","upvotes":10,"discussionId":"682c11ff08d047591841ec50","githubRepo":"https://github.com/guijinSON/SPOT","githubRepoAddedBy":"auto","ai_summary":"Evaluation of LLMs on an academic manuscript verification dataset (SPOT) shows poor recall, precision, and reliability, indicating significant limitations in current AI's ability to replace human verification in scientific research.","ai_keywords":["large language models","LLMs","AI Co-Scientists","generative co-authors","academic verification of scientific manuscripts","SPOT","confidence estimates","qualitative analysis"],"githubStars":9},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"60d3e619b8448e1785bbda2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d3e619b8448e1785bbda2a/q2re5u1HNwsCCyIMtid_I.jpeg","isPro":true,"fullname":"GUIJIN SON","user":"amphora","type":"user"},{"_id":"63e087b6a98d931aa90c1b9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e087b6a98d931aa90c1b9c/4ZnfL0U8rrj3cNhj7WTgo.jpeg","isPro":false,"fullname":"Hyunwoo Ko","user":"Cartinoe5930","type":"user"},{"_id":"63be1cd13b0665ad51d29c37","avatarUrl":"/avatars/5acc9b9bbecac3d567e927e2d8667b00.svg","isPro":false,"fullname":"Seungwon Lim","user":"sngwon","type":"user"},{"_id":"674ea8343209f1fbb76fe046","avatarUrl":"/avatars/8ad608df5eab4595f2c777b030d1ca6f.svg","isPro":false,"fullname":"Jinha Choi","user":"danielc174","type":"user"},{"_id":"6415c043486c7c9a5d151583","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6415c043486c7c9a5d151583/fUdYFh6iVh57swCkBEy-y.jpeg","isPro":false,"fullname":"Jiwoo Hong","user":"JW17","type":"user"},{"_id":"659f9445d5c4ea912705aa4d","avatarUrl":"/avatars/1d3297c3ccad48e5eb6c01e0640dc06d.svg","isPro":false,"fullname":"Heejeong Nam","user":"HazelNam","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"665b133508d536a8ac804f7d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Uwi0OnANdTbRbHHQvGqvR.png","isPro":false,"fullname":"Paulson","user":"Pnaomi","type":"user"},{"_id":"63c3e898c7d7f4c63a51594a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1673783426328-noauth.png","isPro":false,"fullname":"Suzie Oh","user":"ohsuz","type":"user"},{"_id":"67aa9e215716d8c0207eab19","avatarUrl":"/avatars/0484849011e3169051784317f3dc5a96.svg","isPro":false,"fullname":"Joonyong Park","user":"JoonYong-Park","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Evaluation of LLMs on an academic manuscript verification dataset (SPOT) shows poor recall, precision, and reliability, indicating significant limitations in current AI's ability to replace human verification in scientific research.
AI-generated summary
Recent advances in large language models (LLMs) have fueled the vision of
automated scientific discovery, often called AI Co-Scientists. To date, prior
work casts these systems as generative co-authors responsible for crafting
hypotheses, synthesizing code, or drafting manuscripts. In this work, we
explore a complementary application: using LLMs as verifiers to automate the
academic verification of scientific manuscripts. To that end, we
introduce SPOT, a dataset of 83 published papers paired with 91 errors
significant enough to prompt errata or retraction, cross-validated with actual
authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find
that none surpasses 21.1\% recall or 6.1\% precision (o3 achieves the best
scores, with all others near zero). Furthermore, confidence estimates are
uniformly low, and across eight independent runs, models rarely rediscover the
same errors, undermining their reliability. Finally, qualitative analysis with
domain experts reveals that even the strongest models make mistakes resembling
student-level misconceptions derived from misunderstandings. These findings
highlight the substantial gap between current LLM capabilities and the
requirements for dependable AI-assisted academic verification.