Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Thinking Augmented Pre-training
[go: Go Back, main page]

\"屏幕截图

\n

We introduce a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories.

\n","updatedAt":"2025-09-26T02:38:58.370Z","author":{"_id":"62343594c63d91cec1ca4a8d","avatarUrl":"/avatars/5a1ee74c2dbe349a6ec9843a1599d281.svg","fullname":"Liang Wang","name":"intfloat","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":441,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.7812336683273315},"editors":["intfloat"],"editorAvatarUrls":["/avatars/5a1ee74c2dbe349a6ec9843a1599d281.svg"],"reactions":[],"isReport":false}},{"id":"68d73f14a68a37a7d830c8e5","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-09-27T01:34:12.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Reinforcement Learning on Pre-Training Data](https://huggingface.co/papers/2509.19249) (2025)\n* [PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning](https://huggingface.co/papers/2509.19894) (2025)\n* [Can Structured Templates Facilitate LLMs in Tackling Harder Tasks? : An Exploration of Scaling Laws by Difficulty](https://huggingface.co/papers/2508.19069) (2025)\n* [Large-Scale Diverse Synthesis for Mid-Training](https://huggingface.co/papers/2508.01326) (2025)\n* [Apriel-Nemotron-15B-Thinker](https://huggingface.co/papers/2508.10948) (2025)\n* [Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks](https://huggingface.co/papers/2508.18672) (2025)\n* [CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks](https://huggingface.co/papers/2507.23751) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-09-27T01:34:12.973Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7432453632354736},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2509.20186","authors":[{"_id":"68d563258ccd91bdd39ffc79","user":{"_id":"62343594c63d91cec1ca4a8d","avatarUrl":"/avatars/5a1ee74c2dbe349a6ec9843a1599d281.svg","isPro":false,"fullname":"Liang Wang","user":"intfloat","type":"user"},"name":"Liang Wang","status":"admin_assigned","statusLastChangedAt":"2025-09-26T16:12:45.469Z","hidden":false},{"_id":"68d563258ccd91bdd39ffc7a","name":"Nan Yang","hidden":false},{"_id":"68d563258ccd91bdd39ffc7b","user":{"_id":"632bd2f72d6a805eeb4bc601","avatarUrl":"/avatars/6e1533e8a599f3068290aa69ac82cab7.svg","isPro":false,"fullname":"HUANG SHAOHAN","user":"buaahsh","type":"user"},"name":"Shaohan Huang","status":"admin_assigned","statusLastChangedAt":"2025-09-26T16:13:06.086Z","hidden":false},{"_id":"68d563258ccd91bdd39ffc7c","user":{"_id":"5df85abada6d0311fd3d5408","avatarUrl":"/avatars/2331cf703c1b5d3a62e2050b1a6eb108.svg","isPro":false,"fullname":"Li Dong","user":"unilm","type":"user"},"name":"Li Dong","status":"claimed_verified","statusLastChangedAt":"2025-09-29T11:27:38.189Z","hidden":false},{"_id":"68d563258ccd91bdd39ffc7d","user":{"_id":"6368c512fbfe97c16a40baba","avatarUrl":"/avatars/1c23bc7c0b6d9225699ce27647623d7a.svg","isPro":false,"fullname":"Furu Wei","user":"thegenerality","type":"user"},"name":"Furu Wei","status":"admin_assigned","statusLastChangedAt":"2025-09-26T16:13:12.700Z","hidden":false}],"publishedAt":"2025-09-24T14:45:13.000Z","submittedOnDailyAt":"2025-09-26T01:07:10.668Z","title":"Thinking Augmented Pre-training","submittedOnDailyBy":{"_id":"62343594c63d91cec1ca4a8d","avatarUrl":"/avatars/5a1ee74c2dbe349a6ec9843a1599d281.svg","isPro":false,"fullname":"Liang Wang","user":"intfloat","type":"user"},"summary":"This paper introduces a simple and scalable approach to improve the data\nefficiency of large language model (LLM) training by augmenting existing text\ndata with thinking trajectories. The compute for pre-training LLMs has been\ngrowing at an unprecedented rate, while the availability of high-quality data\nremains limited. Consequently, maximizing the utility of available data\nconstitutes a significant research challenge. A primary impediment is that\ncertain high-quality tokens are difficult to learn given a fixed model\ncapacity, as the underlying rationale for a single token can be exceptionally\ncomplex and deep. To address this issue, we propose Thinking augmented\nPre-Training (TPT), a universal methodology that augments text with\nautomatically generated thinking trajectories. Such augmentation effectively\nincreases the volume of the training data and makes high-quality tokens more\nlearnable through step-by-step reasoning and decomposition. We apply TPT across\ndiverse training configurations up to 100B tokens, encompassing pre-training\nwith both constrained and abundant data, as well as mid-training from strong\nopen-source checkpoints. Experimental results indicate that our method\nsubstantially improves the performance of LLMs across various model sizes and\nfamilies. Notably, TPT enhances the data efficiency of LLM pre-training by a\nfactor of 3. For a 3B parameter model, it improves the post-training\nperformance by over 10% on several challenging reasoning benchmarks.","upvotes":23,"discussionId":"68d563258ccd91bdd39ffc7e","ai_summary":"Thinking augmented pre-training improves data efficiency and performance of large language models by augmenting text with automatically generated thinking trajectories.","ai_keywords":["large language model","thinking trajectories","pre-training","data efficiency","high-quality tokens","step-by-step reasoning","decomposition","constrained data","abundant data","mid-training","open-source checkpoints","reasoning benchmarks"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63c6cb6a50cc81901da65e15","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c6cb6a50cc81901da65e15/t4LN1BPCFlwbSJ9GD9YDd.jpeg","isPro":true,"fullname":"Théo Pomies","user":"theopomies","type":"user"},{"_id":"62343594c63d91cec1ca4a8d","avatarUrl":"/avatars/5a1ee74c2dbe349a6ec9843a1599d281.svg","isPro":false,"fullname":"Liang Wang","user":"intfloat","type":"user"},{"_id":"684d57f26e04c265777ead3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cuOj-bQqukSZreXgUJlfm.png","isPro":false,"fullname":"Joakim Lee","user":"Reinforcement4All","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"66a9e066a203add977948988","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66a9e066a203add977948988/mwVS-vt-8p-DFC5T9H9H3.jpeg","isPro":false,"fullname":"hyunjik.jo","user":"switiz87","type":"user"},{"_id":"65c20ee58aedd6edd2b89000","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65c20ee58aedd6edd2b89000/LtS4YTbmxiCFqHSGHfdC8.png","isPro":false,"fullname":"Chmielewski","user":"Eryk-Chmielewski","type":"user"},{"_id":"64245f2c089d5fae56b4549a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64245f2c089d5fae56b4549a/qUHFsL9Svwyj5BKpfMtaY.jpeg","isPro":false,"fullname":"Pengxiang Li","user":"pengxiang","type":"user"},{"_id":"6283e2cff7eb4e70eaf21e08","avatarUrl":"/avatars/80367fbfeb5539af5edda2e856487101.svg","isPro":false,"fullname":"Michael","user":"mwesner1000","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6093a02dc4a92d63a91c5236","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6093a02dc4a92d63a91c5236/yUte6V0FU0BvVFAbON-9n.jpeg","isPro":true,"fullname":"Diwank Tomer","user":"diwank","type":"user"},{"_id":"6368c512fbfe97c16a40baba","avatarUrl":"/avatars/1c23bc7c0b6d9225699ce27647623d7a.svg","isPro":false,"fullname":"Furu Wei","user":"thegenerality","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2509.20186

Thinking Augmented Pre-training

Published on Sep 24, 2025
· Submitted by
Liang Wang
on Sep 26, 2025
Authors:
,

Abstract

Thinking augmented pre-training improves data efficiency and performance of large language models by augmenting text with automatically generated thinking trajectories.

AI-generated summary

This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to 100B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of 3. For a 3B parameter model, it improves the post-training performance by over 10% on several challenging reasoning benchmarks.

Community

Paper author Paper submitter
edited Sep 26, 2025

屏幕截图 2025-09-26 103516

We introduce a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.20186 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.20186 in a Space README.md to link it from this page.

Collections including this paper 4