Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Train Long, Think Short: Curriculum Learning for Efficient Reasoning
https://github.com/hammoudhasan/ curriculum_grpo.\n","updatedAt":"2025-08-13T07:31:08.590Z","author":{"_id":"642b51385bf2355d02a23d15","avatarUrl":"/avatars/87985347643b2647555f2453fa4d94fb.svg","fullname":"Hasan Abed Al Kader Hammoud","name":"hammh0a","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":18,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9195324182510376},"editors":["hammh0a"],"editorAvatarUrls":["/avatars/87985347643b2647555f2453fa4d94fb.svg"],"reactions":[],"isReport":false}},{"id":"689d3e184cdf9c62b8b8f56c","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-08-14T01:38:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control](https://huggingface.co/papers/2506.20160) (2025)\n* [Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs](https://huggingface.co/papers/2507.02076) (2025)\n* [Logit Arithmetic Elicits Long Reasoning Capabilities Without Training](https://huggingface.co/papers/2507.12759) (2025)\n* [Exploring and Exploiting the Inherent Efficiency within Large Reasoning Models for Self-Guided Efficiency Enhancement](https://huggingface.co/papers/2506.15647) (2025)\n* [Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation](https://huggingface.co/papers/2507.00054) (2025)\n* [MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs](https://huggingface.co/papers/2507.02851) (2025)\n* [A Simple\"Try Again\"Can Elicit Multi-Turn LLM Reasoning](https://huggingface.co/papers/2507.14295) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-08-14T01:38:32.989Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7235831618309021},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2508.08940","authors":[{"_id":"689c3f0cfab6fdd2e52ac93a","user":{"_id":"642b51385bf2355d02a23d15","avatarUrl":"/avatars/87985347643b2647555f2453fa4d94fb.svg","isPro":false,"fullname":"Hasan Abed Al Kader Hammoud","user":"hammh0a","type":"user"},"name":"Hasan Abed Al Kader Hammoud","status":"claimed_verified","statusLastChangedAt":"2025-08-14T13:38:43.096Z","hidden":false},{"_id":"689c3f0cfab6fdd2e52ac93b","name":"Kumail Alhamoud","hidden":false},{"_id":"689c3f0cfab6fdd2e52ac93c","name":"Abed Hammoud","hidden":false},{"_id":"689c3f0cfab6fdd2e52ac93d","name":"Elie Bou-Zeid","hidden":false},{"_id":"689c3f0cfab6fdd2e52ac93e","name":"Marzyeh Ghassemi","hidden":false},{"_id":"689c3f0cfab6fdd2e52ac93f","name":"Bernard Ghanem","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/642b51385bf2355d02a23d15/4b_TnX67UO9R8vBKO9NkT.png"],"publishedAt":"2025-08-12T13:48:03.000Z","submittedOnDailyAt":"2025-08-13T06:01:08.584Z","title":"Train Long, Think Short: Curriculum Learning for Efficient Reasoning","submittedOnDailyBy":{"_id":"642b51385bf2355d02a23d15","avatarUrl":"/avatars/87985347643b2647555f2453fa4d94fb.svg","isPro":false,"fullname":"Hasan Abed Al Kader Hammoud","user":"hammh0a","type":"user"},"summary":"Recent work on enhancing the reasoning abilities of large language models\n(LLMs) has introduced explicit length control as a means of constraining\ncomputational cost while preserving accuracy. However, existing approaches rely\non fixed-length training budgets, which do not take advantage of the natural\nprogression from exploration to compression during learning. In this work, we\npropose a curriculum learning strategy for length-controlled reasoning using\nGroup Relative Policy Optimization (GRPO). Our method starts with generous\ntoken budgets and gradually tightens them over training, encouraging models to\nfirst discover effective solution strategies and then distill them into more\nconcise reasoning traces. We augment GRPO with a reward function that balances\nthree signals: task correctness (via verifier feedback), length efficiency, and\nformatting adherence (via structural tags). Experiments on GSM8K, MATH500,\nSVAMP, College Math, and GSM+ demonstrate that curriculum-based training\nconsistently outperforms fixed-budget baselines at the same final budget,\nachieving higher accuracy and significantly improved token efficiency. We\nfurther ablate the impact of reward weighting and decay schedule design,\nshowing that progressive constraint serves as a powerful inductive bias for\ntraining efficient reasoning models. Our code and checkpoints are released at:\nhttps://github.com/hammoudhasan/curriculum_grpo.","upvotes":27,"discussionId":"689c3f0cfab6fdd2e52ac940","githubRepo":"https://github.com/hammoudhasan/curriculum_grpo","githubRepoAddedBy":"auto","ai_summary":"A curriculum learning strategy using Group Relative Policy Optimization (GRPO) enhances the reasoning abilities of large language models by progressively tightening token budgets, improving accuracy and token efficiency.","ai_keywords":["Group Relative Policy Optimization (GRPO)","curriculum learning","token budgets","verifier feedback","structural tags","GSM8K","MATH500","SVAMP","College Math","GSM+"],"githubStars":7},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"642b51385bf2355d02a23d15","avatarUrl":"/avatars/87985347643b2647555f2453fa4d94fb.svg","isPro":false,"fullname":"Hasan Abed Al Kader Hammoud","user":"hammh0a","type":"user"},{"_id":"65dbddf890eb0cec07dcb565","avatarUrl":"/avatars/0dcdf0925510711b90e24781bdea79e9.svg","isPro":true,"fullname":"Kumail Alhamoud","user":"m1k2zoo","type":"user"},{"_id":"672b17efcb9a8b2f215a8330","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ROlQr69Tn-Q7aDG3c3gdB.png","isPro":false,"fullname":"Karen Sanchez","user":"ksanchez84","type":"user"},{"_id":"6684c10e42cbc4ce0b562348","avatarUrl":"/avatars/5c72d7c269f8977a42e55bfc15fc2f8b.svg","isPro":false,"fullname":"Kenny Yang","user":"kennyyang","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6620e21d11561bf979229d9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6620e21d11561bf979229d9f/qzMMJI4PJfEkJDYK4mWZM.jpeg","isPro":false,"fullname":"Carlos Hinojosa","user":"carlosh93","type":"user"},{"_id":"666ddb45c0f3d5afc27e85ba","avatarUrl":"/avatars/dce98fc77bd8cf9e348f2d91bc3c0225.svg","isPro":false,"fullname":"Bing Li","user":"bing-li-ai","type":"user"},{"_id":"668f485c54a9166eda657112","avatarUrl":"/avatars/66e69109f5223970832db65f47803ee4.svg","isPro":false,"fullname":"高喆","user":"tsaganshosg","type":"user"},{"_id":"66feb4a4d74232f132e9a4b0","avatarUrl":"/avatars/9c9e1c0c03ce24af8afd351574938844.svg","isPro":false,"fullname":"Aznaur Aliev","user":"Aznaur","type":"user"},{"_id":"66c72cbea575572fcb8942c2","avatarUrl":"/avatars/217b19ed9c17329b63e232c42010b648.svg","isPro":false,"fullname":"Hani Al Majed","user":"haniaa2","type":"user"},{"_id":"65f6eff66396309f02a18dab","avatarUrl":"/avatars/3b315ecbe2815859ca7fb16d277740c8.svg","isPro":false,"fullname":"Harethah Abu Shairah","user":"HarethahMo","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A curriculum learning strategy using Group Relative Policy Optimization (GRPO) enhances the reasoning abilities of large language models by progressively tightening token budgets, improving accuracy and token efficiency.
AI-generated summary
Recent work on enhancing the reasoning abilities of large language models
(LLMs) has introduced explicit length control as a means of constraining
computational cost while preserving accuracy. However, existing approaches rely
on fixed-length training budgets, which do not take advantage of the natural
progression from exploration to compression during learning. In this work, we
propose a curriculum learning strategy for length-controlled reasoning using
Group Relative Policy Optimization (GRPO). Our method starts with generous
token budgets and gradually tightens them over training, encouraging models to
first discover effective solution strategies and then distill them into more
concise reasoning traces. We augment GRPO with a reward function that balances
three signals: task correctness (via verifier feedback), length efficiency, and
formatting adherence (via structural tags). Experiments on GSM8K, MATH500,
SVAMP, College Math, and GSM+ demonstrate that curriculum-based training
consistently outperforms fixed-budget baselines at the same final budget,
achieving higher accuracy and significantly improved token efficiency. We
further ablate the impact of reward weighting and decay schedule design,
showing that progressive constraint serves as a powerful inductive bias for
training efficient reasoning models. Our code and checkpoints are released at:
https://github.com/hammoudhasan/curriculum_grpo.
Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags). Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. We further ablate the impact of reward weighting and decay schedule design, showing that progressive constraint serves as a powerful inductive bias for training efficient reasoning models. Our code and checkpoints are released at: https://github.com/hammoudhasan/ curriculum_grpo.