Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
[go: Go Back, main page]

https://arxivexplained.com/papers/does-math-reasoning-improve-general-llm-capabilities-understanding-transferability-of-llm-reasoning

\n","updatedAt":"2025-07-02T18:46:22.698Z","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7317613959312439},"editors":["grantsing"],"editorAvatarUrls":["/avatars/3aedb9522cc3cd08349d654f523fd792.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2507.00432","authors":[{"_id":"686490e9d59a9eda59024a64","name":"Maggie Huan","hidden":false},{"_id":"686490e9d59a9eda59024a65","name":"Yuetai Li","hidden":false},{"_id":"686490e9d59a9eda59024a66","user":{"_id":"64ab99dcb76bfd863eba64c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ab99dcb76bfd863eba64c1/UBXwDPx17X-gl-SzBPvrc.jpeg","isPro":false,"fullname":"TY.Zheng","user":"aaabiao","type":"user"},"name":"Tuney Zheng","status":"claimed_verified","statusLastChangedAt":"2025-07-02T07:39:59.058Z","hidden":false},{"_id":"686490e9d59a9eda59024a67","name":"Xiaoyu Xu","hidden":false},{"_id":"686490e9d59a9eda59024a68","user":{"_id":"6469949654873f0043b09c22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg","isPro":true,"fullname":"Seungone Kim","user":"seungone","type":"user"},"name":"Seungone Kim","status":"claimed_verified","statusLastChangedAt":"2025-12-29T14:26:47.315Z","hidden":false},{"_id":"686490e9d59a9eda59024a69","name":"Minxin Du","hidden":false},{"_id":"686490e9d59a9eda59024a6a","name":"Radha Poovendran","hidden":false},{"_id":"686490e9d59a9eda59024a6b","name":"Graham Neubig","hidden":false},{"_id":"686490e9d59a9eda59024a6c","user":{"_id":"6230d750d93e84e233882dbc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6230d750d93e84e233882dbc/4MGEekLW3oWzqeFWDWvIK.jpeg","isPro":false,"fullname":"Xiang Yue","user":"yuexiang96","type":"user"},"name":"Xiang Yue","status":"claimed_verified","statusLastChangedAt":"2025-07-02T16:07:53.653Z","hidden":false}],"publishedAt":"2025-07-01T05:23:05.000Z","submittedOnDailyAt":"2025-07-02T00:24:45.275Z","title":"Does Math Reasoning Improve General LLM Capabilities? Understanding\n Transferability of LLM Reasoning","submittedOnDailyBy":{"_id":"6230d750d93e84e233882dbc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6230d750d93e84e233882dbc/4MGEekLW3oWzqeFWDWvIK.jpeg","isPro":false,"fullname":"Xiang Yue","user":"yuexiang96","type":"user"},"summary":"Math reasoning has become the poster child of progress in large language\nmodels (LLMs), with new models rapidly surpassing human-level performance on\nbenchmarks like MATH and AIME. But as math leaderboards improve week by week,\nit is worth asking: do these gains reflect broader problem-solving ability or\njust narrow overfitting? To answer this question, we evaluate over 20\nopen-weight reasoning-tuned models across a broad suite of tasks, including\nmath, scientific QA, agent planning, coding, and standard\ninstruction-following. We surprisingly find that most models that succeed in\nmath fail to transfer their gains to other domains. To rigorously study this\nphenomenon, we conduct controlled experiments on Qwen3-14B models using\nmath-only data but different tuning methods. We find that reinforcement\nlearning (RL)-tuned models generalize well across domains, while supervised\nfine-tuning (SFT)-tuned models often forget general capabilities. Latent-space\nrepresentation and token-space distribution shift analyses reveal that SFT\ninduces substantial representation and output drift, while RL preserves\ngeneral-domain structure. Our results suggest a need to rethink standard\npost-training recipes, particularly the reliance on SFT-distilled data for\nadvancing reasoning models.","upvotes":79,"discussionId":"686490e9d59a9eda59024a6d","githubRepo":"https://github.com/ReasoningTransfer/Transferability-of-LLM-Reasoning","githubRepoAddedBy":"user","ai_summary":"Reinforcement learning-tuned models generalize better across domains compared to supervised fine-tuned models in reasoning tasks, indicating a need to reconsider standard training methods.","ai_keywords":["reinforcement learning","supervised fine-tuning","latent-space representation","token-space distribution shift","representation drift","output drift","general-domain structure"],"githubStars":109},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6230d750d93e84e233882dbc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6230d750d93e84e233882dbc/4MGEekLW3oWzqeFWDWvIK.jpeg","isPro":false,"fullname":"Xiang Yue","user":"yuexiang96","type":"user"},{"_id":"60de14638bedd2315529d43f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1625166923504-noauth.png","isPro":false,"fullname":"Graham Neubig","user":"gneubig","type":"user"},{"_id":"65dafc22ad7ccf910d7144da","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65dafc22ad7ccf910d7144da/bsGJXsVjwJTVoqSO0b1O3.jpeg","isPro":false,"fullname":"Yuetai Li","user":"TaiGary","type":"user"},{"_id":"65556c0ee0169cf32ce13ce6","avatarUrl":"/avatars/c3648910cf53875ce953f23be59323b2.svg","isPro":false,"fullname":"Maggie Huan","user":"Ibisbill","type":"user"},{"_id":"685dfe0a3e85fda10a3d0207","avatarUrl":"/avatars/070914671a8a5544d43fbdce4636e943.svg","isPro":false,"fullname":"Reasoning Transfer","user":"ReasoningTransfer","type":"user"},{"_id":"64ab99dcb76bfd863eba64c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ab99dcb76bfd863eba64c1/UBXwDPx17X-gl-SzBPvrc.jpeg","isPro":false,"fullname":"TY.Zheng","user":"aaabiao","type":"user"},{"_id":"67337dffe77108f3cce35005","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67337dffe77108f3cce35005/2mlcK-k7G6UgijO2J3QoP.jpeg","isPro":false,"fullname":"yunwenLi","user":"JunoLi622","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"65c31753809ce3301b633033","avatarUrl":"/avatars/9dc84f935c8328a8fec202afcbd18c02.svg","isPro":false,"fullname":"seojinlee","user":"sjlee311","type":"user"},{"_id":"63a3eb8af460e4379b5991e7","avatarUrl":"/avatars/7564a048d8496cac38d689178d90a8f9.svg","isPro":false,"fullname":"Xiaohan Xu","user":"Tebmer","type":"user"},{"_id":"66ebb9ce703a567feca801f2","avatarUrl":"/avatars/93711b8f90656c5472c79764a8560d0b.svg","isPro":false,"fullname":"myname","user":"myname12334112","type":"user"},{"_id":"630c2ddb86b8b9904c3860a6","avatarUrl":"/avatars/9b6cec2e9e269ccac1533eb7bf1ac2c5.svg","isPro":false,"fullname":"Igor Melnyk","user":"imelnyk","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2507.00432

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Published on Jul 1, 2025
· Submitted by
Xiang Yue
on Jul 2, 2025
Authors:
,
,
,
,
,
,

Abstract

Reinforcement learning-tuned models generalize better across domains compared to supervised fine-tuned models in reasoning tasks, indicating a need to reconsider standard training methods.

AI-generated summary

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

Community

Paper author Paper submitter

Takeaways:

  • We evaluate over 20 open-weight reasoning-tuned models and surprisingly find that most models that succeed in math fail to transfer their gains to other domains.
  • We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities.
  • Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure.

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.00432 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.00432 in a Space README.md to link it from this page.

Collections including this paper 24