Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering
[go: Go Back, main page]

\"Screenshot

\n","updatedAt":"2025-02-20T04:34:43.445Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9178,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3683331310749054},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"67b77c094b2d66438e2576f0","author":{"_id":"6372bc95c4267fd7cd77f4d0","avatarUrl":"/avatars/17a24af68f45487e601687d777b352b6.svg","fullname":"William Jurayj","name":"wjurayj","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2025-02-20T19:01:29.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.\n","html":"

Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

\n","updatedAt":"2025-02-20T19:01:29.317Z","author":{"_id":"6372bc95c4267fd7cd77f4d0","avatarUrl":"/avatars/17a24af68f45487e601687d777b352b6.svg","fullname":"William Jurayj","name":"wjurayj","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9453322887420654},"editors":["wjurayj"],"editorAvatarUrls":["/avatars/17a24af68f45487e601687d777b352b6.svg"],"reactions":[],"isReport":false}},{"id":"67b7d83a6c3182e96bbda789","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-02-21T01:34:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning](https://huggingface.co/papers/2502.07154) (2025)\n* [Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong](https://huggingface.co/papers/2501.09775) (2025)\n* [Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach](https://huggingface.co/papers/2502.05171) (2025)\n* [Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning](https://huggingface.co/papers/2501.05069) (2025)\n* [ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning](https://huggingface.co/papers/2502.04689) (2025)\n* [Benchmarking Large Language Models via Random Variables](https://huggingface.co/papers/2501.11790) (2025)\n* [Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling](https://huggingface.co/papers/2502.06703) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-02-21T01:34:50.456Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7108069658279419},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.13962","authors":[{"_id":"67b691751f861500916ecd5d","user":{"_id":"6372bc95c4267fd7cd77f4d0","avatarUrl":"/avatars/17a24af68f45487e601687d777b352b6.svg","isPro":false,"fullname":"William Jurayj","user":"wjurayj","type":"user"},"name":"William Jurayj","status":"claimed_verified","statusLastChangedAt":"2025-02-20T09:36:09.674Z","hidden":false},{"_id":"67b691751f861500916ecd5e","user":{"_id":"65f28eebf5cf26fe0632ce67","avatarUrl":"/avatars/cd970cfe4215374c82d47df57ac30795.svg","isPro":false,"fullname":"Jeffrey Cheng","user":"nexync","type":"user"},"name":"Jeffrey Cheng","status":"claimed_verified","statusLastChangedAt":"2025-02-20T15:53:07.241Z","hidden":false},{"_id":"67b691751f861500916ecd5f","name":"Benjamin Van Durme","hidden":false}],"publishedAt":"2025-02-19T18:58:31.000Z","submittedOnDailyAt":"2025-02-20T02:04:43.424Z","title":"Is That Your Final Answer? Test-Time Scaling Improves Selective Question\n Answering","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Scaling the test-time compute of large language models has demonstrated\nimpressive performance on reasoning benchmarks. However, existing evaluations\nof test-time scaling make the strong assumption that a reasoning system should\nalways give an answer to any question provided. This overlooks concerns about\nwhether a model is confident in its answer, and whether it is appropriate to\nalways provide a response. To address these concerns, we extract confidence\nscores during reasoning for thresholding model responses. We find that\nincreasing compute budget at inference time not only helps models answer more\nquestions correctly, but also increases confidence in correct responses. We\nthen extend the current paradigm of zero-risk responses during evaluation by\nconsidering settings with non-zero levels of response risk, and suggest a\nrecipe for reporting evaluations under these settings.","upvotes":28,"discussionId":"67b691761f861500916ecd8e","githubRepo":"https://github.com/wjurayj/final_answer","githubRepoAddedBy":"auto","ai_summary":"Evaluating large language models includes confidence thresholding to manage response risk at test time, showing that increased compute improves both accuracy and confidence.","ai_keywords":["confidence scores","inference time","zero-risk responses","response risk"],"githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"644ffb154f731658826945c8","avatarUrl":"/avatars/46fa3a15c2b974692bee5b6ed6468be0.svg","isPro":false,"fullname":"Alex Martin","user":"alexmartin1722","type":"user"},{"_id":"6372bc95c4267fd7cd77f4d0","avatarUrl":"/avatars/17a24af68f45487e601687d777b352b6.svg","isPro":false,"fullname":"William Jurayj","user":"wjurayj","type":"user"},{"_id":"63e410e88083f19a218be964","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e410e88083f19a218be964/8gqsYilSTSCt8E3JmajP-.jpeg","isPro":false,"fullname":"Marc Marone","user":"mmarone","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"61ffed9f577fcfa0ce82f169","avatarUrl":"/avatars/46c0c371a54e9a310d77b4b71d3f5339.svg","isPro":true,"fullname":"Cameron C","user":"cameronbc","type":"user"},{"_id":"6526c0e84fd6c72e04867f6d","avatarUrl":"/avatars/acfff50a01a4460dd59bc43cfd725cb6.svg","isPro":false,"fullname":"Rachel Wicks","user":"rewicks","type":"user"},{"_id":"6362d9712691058b19de1ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6362d9712691058b19de1ba4/Hdqj5aGrFJJbF7oUSzoIh.jpeg","isPro":true,"fullname":"Orion Weller","user":"orionweller","type":"user"},{"_id":"6352c7f68fbed903856d21b6","avatarUrl":"/avatars/c504967624ef2401d6556dc6a50a1aab.svg","isPro":false,"fullname":"Danny Adkins","user":"dadkins","type":"user"},{"_id":"66ec4c8b3f928a11d81ec44d","avatarUrl":"/avatars/fd02608a9db8c67cf299d0a67a0c5e8a.svg","isPro":false,"fullname":"Kaavya Chaparala","user":"kchaparala9089","type":"user"},{"_id":"65f28eebf5cf26fe0632ce67","avatarUrl":"/avatars/cd970cfe4215374c82d47df57ac30795.svg","isPro":false,"fullname":"Jeffrey Cheng","user":"nexync","type":"user"},{"_id":"6593d7e70800e55419625697","avatarUrl":"/avatars/4ffb9e50a450655aabbfbfaa22e1a87c.svg","isPro":false,"fullname":"Xiluo He","user":"xiluohe","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2502.13962

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Published on Feb 19, 2025
· Submitted by
AK
on Feb 20, 2025
Authors:

Abstract

Evaluating large language models includes confidence thresholding to manage response risk at test time, showing that increased compute improves both accuracy and confidence.

AI-generated summary

Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

Community

Paper author
This comment has been hidden
Paper submitter

Screenshot 2025-02-19 at 11.34.31 PM.png

Paper author

Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.13962 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.13962 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.13962 in a Space README.md to link it from this page.

Collections including this paper 4