https://github.com/UCSC-VLAA/m1 🤗🤗🤗\n","updatedAt":"2025-04-02T04:00:28.563Z","author":{"_id":"63318b2349a9563915469f3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63318b2349a9563915469f3b/zlbeB2997i8YkoyOTb9FL.jpeg","fullname":"Xiaoke Huang","name":"xk-huang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9091883897781372},"editors":["xk-huang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63318b2349a9563915469f3b/zlbeB2997i8YkoyOTb9FL.jpeg"],"reactions":[],"isReport":false}},{"id":"67ede5fa21d7e74ee3e20487","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-03T01:35:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking](https://huggingface.co/papers/2503.19855) (2025)\n* [Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?](https://huggingface.co/papers/2502.12215) (2025)\n* [Theorem Prover as a Judge for Synthetic Data Generation](https://huggingface.co/papers/2502.13137) (2025)\n* [Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models](https://huggingface.co/papers/2503.16419) (2025)\n* [MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task](https://huggingface.co/papers/2502.11684) (2025)\n* [MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning](https://huggingface.co/papers/2503.07459) (2025)\n* [InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models](https://huggingface.co/papers/2503.06692) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n
\n
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-04-03T01:35:54.977Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7593270540237427},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.00869","authors":[{"_id":"67ecb60276900f68cd1df503","user":{"_id":"63318b2349a9563915469f3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63318b2349a9563915469f3b/zlbeB2997i8YkoyOTb9FL.jpeg","isPro":false,"fullname":"Xiaoke Huang","user":"xk-huang","type":"user"},"name":"Xiaoke Huang","status":"admin_assigned","statusLastChangedAt":"2025-04-02T12:27:31.825Z","hidden":false},{"_id":"67ecb60276900f68cd1df504","user":{"_id":"641a58d791e3376a0579f011","avatarUrl":"/avatars/bb4745bb83ac0ee094b43e1e5562d743.svg","isPro":false,"fullname":"Juncheng Wu","user":"JunchengWu","type":"user"},"name":"Juncheng Wu","status":"admin_assigned","statusLastChangedAt":"2025-04-02T12:27:39.042Z","hidden":false},{"_id":"67ecb60276900f68cd1df505","name":"Hui Liu","hidden":false},{"_id":"67ecb60276900f68cd1df506","user":{"_id":"6465f6467ff8fcbef7d22513","avatarUrl":"/avatars/07992835c235fbb07016a0ea4f1d61cb.svg","isPro":false,"fullname":"Xianfeng Tang","user":"xianft","type":"user"},"name":"Xianfeng Tang","status":"admin_assigned","statusLastChangedAt":"2025-04-02T12:27:48.508Z","hidden":false},{"_id":"67ecb60276900f68cd1df507","user":{"_id":"66c7fb4ce2c92fe5b132f314","avatarUrl":"/avatars/22d915fa339a70803c5c748255250256.svg","isPro":false,"fullname":"Yuyin Zhou","user":"RitaCoding","type":"user"},"name":"Yuyin Zhou","status":"admin_assigned","statusLastChangedAt":"2025-04-02T12:27:54.857Z","hidden":false}],"publishedAt":"2025-04-01T14:57:43.000Z","submittedOnDailyAt":"2025-04-02T02:30:28.553Z","title":"m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning\n with Large Language Models","submittedOnDailyBy":{"_id":"63318b2349a9563915469f3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63318b2349a9563915469f3b/zlbeB2997i8YkoyOTb9FL.jpeg","isPro":false,"fullname":"Xiaoke Huang","user":"xk-huang","type":"user"},"summary":"Test-time scaling has emerged as a powerful technique for enhancing the\nreasoning capabilities of large language models. However, its effectiveness in\nmedical reasoning remains uncertain, as the medical domain fundamentally\ndiffers from mathematical tasks in terms of knowledge representation and\ndecision-making processes. In this paper, we provide the first comprehensive\ninvestigation of test-time scaling for medical reasoning and present m1, a\nsimple yet effective approach that increases a model's medical reasoning\ncapability at inference. Our evaluation across diverse medical tasks\ndemonstrates that test-time scaling consistently enhances medical reasoning,\nenabling lightweight fine-tuned models under 10B parameters to establish new\nstate-of-the-art performance, while our 32B model rivals previous 70B-scale\nmedical LLMs. However, we identify an optimal reasoning token budget of\napproximately 4K, beyond which performance may degrade due to overthinking.\nBudget forcing, which extends test-time computation through iterative prompts,\nhelps models double-check answers but does not necessarily improve the overall\nmedical QA performance and, in some cases, even introduces errors into\npreviously correct responses. Our case-by-case analysis identifies insufficient\nmedical knowledge as a key bottleneck that prevents further performance gains\nthrough test-time scaling. We find that increasing data scale, improving data\nquality, and expanding model capacity consistently enhance medical knowledge\ngrounding, enabling continued performance improvements, particularly on\nchallenging medical benchmarks where smaller models reach saturation. These\nfindings underscore fundamental differences between medical and mathematical\nreasoning in LLMs, highlighting that enriched medical knowledge, other than\nincreased reasoning depth alone, is essential for realizing the benefits of\ntest-time scaling.","upvotes":10,"discussionId":"67ecb60376900f68cd1df550","githubRepo":"https://github.com/UCSC-VLAA/m1","githubRepoAddedBy":"user","ai_summary":"Test-time scaling enhances medical reasoning in large language models, enabling smaller models to achieve state-of-the-art performance with an optimal reasoning token budget, while emphasizing the importance of medical knowledge grounding.","ai_keywords":["test-time scaling","medical reasoning","knowledge grounding","medical knowledge","reasoning token budget","iterative prompts","budget forcing","medical QA","medical benchmarks"],"githubStars":48},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63318b2349a9563915469f3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63318b2349a9563915469f3b/zlbeB2997i8YkoyOTb9FL.jpeg","isPro":false,"fullname":"Xiaoke Huang","user":"xk-huang","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"665b133508d536a8ac804f7d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Uwi0OnANdTbRbHHQvGqvR.png","isPro":false,"fullname":"Paulson","user":"Pnaomi","type":"user"},{"_id":"64495e66df4e6cb7eaecbd27","avatarUrl":"/avatars/fae2d08d423e20583b32090db3a64ca1.svg","isPro":false,"fullname":"Praveen Kumar P S","user":"Pravi21","type":"user"},{"_id":"67ee275be7defc1b86506110","avatarUrl":"/avatars/0836ab7ebab1e2b4a6bc914ace440783.svg","isPro":false,"fullname":"Zhao Liyang","user":"lyzhao2000","type":"user"},{"_id":"651c80a26ba9ab9b9582c273","avatarUrl":"/avatars/e963452eafd21f517d800f2e58e0f918.svg","isPro":false,"fullname":"siyeng feng","user":"siyengfeng","type":"user"},{"_id":"645d63c0ce72244df7b36be8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/645d63c0ce72244df7b36be8/09vhYAzgv1svwvQM4eIE9.jpeg","isPro":false,"fullname":"MoRezaGH","user":"Moreza009","type":"user"},{"_id":"660026b7573abbcdb975a34f","avatarUrl":"/avatars/93defd0e6274cfe8f124220c59ec2bed.svg","isPro":false,"fullname":"Juncheng Wu","user":"Chtholly17","type":"user"},{"_id":"64bbe9b236eb058cd9d6a5b9","avatarUrl":"/avatars/c7c01a3fa8809e73800392679abff6d5.svg","isPro":false,"fullname":"Kai Zuberbühler","user":"kaizuberbuehler","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning
with Large Language Models
Abstract
Test-time scaling enhances medical reasoning in large language models, enabling smaller models to achieve state-of-the-art performance with an optimal reasoning token budget, while emphasizing the importance of medical knowledge grounding.
Test-time scaling has emerged as a powerful technique for enhancing the
reasoning capabilities of large language models. However, its effectiveness in
medical reasoning remains uncertain, as the medical domain fundamentally
differs from mathematical tasks in terms of knowledge representation and
decision-making processes. In this paper, we provide the first comprehensive
investigation of test-time scaling for medical reasoning and present m1, a
simple yet effective approach that increases a model's medical reasoning
capability at inference. Our evaluation across diverse medical tasks
demonstrates that test-time scaling consistently enhances medical reasoning,
enabling lightweight fine-tuned models under 10B parameters to establish new
state-of-the-art performance, while our 32B model rivals previous 70B-scale
medical LLMs. However, we identify an optimal reasoning token budget of
approximately 4K, beyond which performance may degrade due to overthinking.
Budget forcing, which extends test-time computation through iterative prompts,
helps models double-check answers but does not necessarily improve the overall
medical QA performance and, in some cases, even introduces errors into
previously correct responses. Our case-by-case analysis identifies insufficient
medical knowledge as a key bottleneck that prevents further performance gains
through test-time scaling. We find that increasing data scale, improving data
quality, and expanding model capacity consistently enhance medical knowledge
grounding, enabling continued performance improvements, particularly on
challenging medical benchmarks where smaller models reach saturation. These
findings underscore fundamental differences between medical and mathematical
reasoning in LLMs, highlighting that enriched medical knowledge, other than
increased reasoning depth alone, is essential for realizing the benefits of
test-time scaling.