Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Language Models are Hidden Reasoners: Unlocking Latent Reasoning
Capabilities via Self-Rewarding
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-11-12T01:33:00.548Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7488733530044556},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2411.04282","authors":[{"_id":"6732834b2c3d3a050a86288a","user":{"_id":"65e0c21ebd0c59bbf271004c","avatarUrl":"/avatars/3210f41fd433d798fd7857c943288625.svg","isPro":false,"fullname":"Haolin Chen","user":"hlnchen","type":"user"},"name":"Haolin Chen","status":"extracted_confirmed","statusLastChangedAt":"2025-10-09T23:07:24.037Z","hidden":false},{"_id":"6732834b2c3d3a050a86288b","name":"Yihao Feng","hidden":false},{"_id":"6732834b2c3d3a050a86288c","user":{"_id":"638e971f2cc490759fe9210e","avatarUrl":"/avatars/9578d0badc084d340c028a59e571abd3.svg","isPro":false,"fullname":"Liu","user":"Zuxin","type":"user"},"name":"Zuxin Liu","status":"claimed_verified","statusLastChangedAt":"2024-11-15T09:27:25.665Z","hidden":false},{"_id":"6732834b2c3d3a050a86288d","user":{"_id":"632cdea254e2c512c8f95b12","avatarUrl":"/avatars/a6d06cdd75861ae7d589f1343d81a5c5.svg","isPro":false,"fullname":"Weiran Yao","user":"weirayao","type":"user"},"name":"Weiran Yao","status":"claimed_verified","statusLastChangedAt":"2024-11-15T09:27:23.471Z","hidden":false},{"_id":"6732834b2c3d3a050a86288e","user":{"_id":"6475718de0b188d3cb25f346","avatarUrl":"/avatars/b0cdca7e19bbc1dbac00ea664108b51e.svg","isPro":false,"fullname":"Akshara Prabhakar","user":"aksh555","type":"user"},"name":"Akshara Prabhakar","status":"claimed_verified","statusLastChangedAt":"2025-04-04T07:11:17.227Z","hidden":false},{"_id":"6732834b2c3d3a050a86288f","name":"Shelby Heinecke","hidden":false},{"_id":"6732834b2c3d3a050a862890","name":"Ricky Ho","hidden":false},{"_id":"6732834b2c3d3a050a862891","name":"Phil Mui","hidden":false},{"_id":"6732834b2c3d3a050a862892","name":"Silvio Savarese","hidden":false},{"_id":"6732834b2c3d3a050a862893","name":"Caiming Xiong","hidden":false},{"_id":"6732834b2c3d3a050a862894","name":"Huan Wang","hidden":false}],"publishedAt":"2024-11-06T22:02:30.000Z","submittedOnDailyAt":"2024-11-11T19:52:06.599Z","title":"Language Models are Hidden Reasoners: Unlocking Latent Reasoning\n Capabilities via Self-Rewarding","submittedOnDailyBy":{"_id":"65e0c21ebd0c59bbf271004c","avatarUrl":"/avatars/3210f41fd433d798fd7857c943288625.svg","isPro":false,"fullname":"Haolin Chen","user":"hlnchen","type":"user"},"summary":"Large language models (LLMs) have shown impressive capabilities, but still\nstruggle with complex reasoning tasks requiring multiple steps. While\nprompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at\ninference time, optimizing reasoning capabilities during training remains\nchallenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled\nframework that formulates reasoning as sampling from a latent distribution and\noptimizes it via variational approaches. LaTRO enables LLMs to concurrently\nimprove both their reasoning process and ability to evaluate reasoning quality,\nwithout requiring external feedback or reward models. We validate LaTRO through\nexperiments on GSM8K and ARC-Challenge datasets using multiple model\narchitectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of\n12.5% over base models and 9.6% over supervised fine-tuning across\nPhi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that\npre-trained LLMs possess latent reasoning capabilities that can be unlocked and\nenhanced through our proposed optimization approach in a self-improvement\nmanner. The code of LaTRO is available at\nhttps://github.com/SalesforceAIResearch/LaTRO.","upvotes":37,"discussionId":"6732834c2c3d3a050a8628cf","githubRepo":"https://github.com/salesforceairesearch/latro","githubRepoAddedBy":"auto","ai_summary":"LaTent Reasoning Optimization (LaTRO) enhances LLMs' reasoning capabilities through variational optimization of latent distributions, improving zero-shot accuracy on complex reasoning tasks.","ai_keywords":["Large language models (LLMs)","Chain-of-Thought (CoT)","LaTent Reasoning Optimization (LaTRO)","latent distribution","variational approaches","GSM8K","ARC-Challenge","zero-shot accuracy","Phi-3.5-mini","Mistral-7B","Llama-3.1-8B","pre-trained LLMs","latent reasoning capabilities","self-improvement"],"githubStars":123},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"632cdea254e2c512c8f95b12","avatarUrl":"/avatars/a6d06cdd75861ae7d589f1343d81a5c5.svg","isPro":false,"fullname":"Weiran Yao","user":"weirayao","type":"user"},{"_id":"661573234c2f29635e93bb71","avatarUrl":"/avatars/fba95e382454485766b6349d6281b715.svg","isPro":false,"fullname":"Weiran Yao","user":"weiranyao","type":"user"},{"_id":"638e971f2cc490759fe9210e","avatarUrl":"/avatars/9578d0badc084d340c028a59e571abd3.svg","isPro":false,"fullname":"Liu","user":"Zuxin","type":"user"},{"_id":"66b3a78add5e383bc5ad3720","avatarUrl":"/avatars/1ba9b8ff749160c0726ff15690391ad4.svg","isPro":false,"fullname":"Shiyu Wang","user":"shiyuw","type":"user"},{"_id":"66d73803b9e69dfa9b7bd6a1","avatarUrl":"/avatars/861308f30b3898ac19d71faeddd9afba.svg","isPro":false,"fullname":"zeuskings","user":"zeuskings","type":"user"},{"_id":"65e0c21ebd0c59bbf271004c","avatarUrl":"/avatars/3210f41fd433d798fd7857c943288625.svg","isPro":false,"fullname":"Haolin Chen","user":"hlnchen","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"666a133e6cf1faccbd2f37a2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Hh7iMFzRq3wSJRLhv-ZmG.jpeg","isPro":false,"fullname":"Zuxin Liu","user":"zuxin-llm","type":"user"},{"_id":"646def60df618b303b419323","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646def60df618b303b419323/JLJGYen4-5M8ivsLsSk0w.jpeg","isPro":false,"fullname":"Lei Wang","user":"demolei","type":"user"},{"_id":"64dd7e5e573d067c9e84a2d1","avatarUrl":"/avatars/241c876a7fcff6d12b7660f8a5a2a534.svg","isPro":false,"fullname":"Honglu Zhou","user":"hongluzhou","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"},{"_id":"64c49396bf1954890192bfbf","avatarUrl":"/avatars/b2dbd1d601911440788efcea2a2b77d3.svg","isPro":false,"fullname":"Yutong Dai","user":"UncleFish","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">
LaTent Reasoning Optimization (LaTRO) enhances LLMs' reasoning capabilities through variational optimization of latent distributions, improving zero-shot accuracy on complex reasoning tasks.
AI-generated summary
Large language models (LLMs) have shown impressive capabilities, but still
struggle with complex reasoning tasks requiring multiple steps. While
prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at
inference time, optimizing reasoning capabilities during training remains
challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled
framework that formulates reasoning as sampling from a latent distribution and
optimizes it via variational approaches. LaTRO enables LLMs to concurrently
improve both their reasoning process and ability to evaluate reasoning quality,
without requiring external feedback or reward models. We validate LaTRO through
experiments on GSM8K and ARC-Challenge datasets using multiple model
architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of
12.5% over base models and 9.6% over supervised fine-tuning across
Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that
pre-trained LLMs possess latent reasoning capabilities that can be unlocked and
enhanced through our proposed optimization approach in a self-improvement
manner. The code of LaTRO is available at
https://github.com/SalesforceAIResearch/LaTRO.
Chain-of-thought (CoT) demonstrated strong reasoning capabilities of LLMs. But how to train them to reason? Introducing LaTent Reasoning Optimization (LaTRO): a principled framework that formulates the reasoning trajectory as a latent variable and optimize the reasoning via variational approaches.
LaTRO has good performance: we improve zero-shot accuracy by an average of 12.5% over 3 different base models: Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B.
LaTRO is reward model-free: Surprisingly but reasonable, the log probabilities of producing the correct answer after the reasoning trajectory serves as a natural reward function, which we call "Self-rewarding". No need to train additional reward models as in RLHF!
LaTRO shifts the inference-time scaling back to training time - by self-generating multiple reasoning trajectories and self-rewarding them with groundtruth during each training update
Free side benefit: one can compress the length of reasoning trajectories via LaTRO - on GSM8K, a model with 200 reasoning tokens achieves 78% performance of a model with 500 reasoning tokens.