https://github.com/TianduoWang/DPO-ST

\n","updatedAt":"2024-07-30T02:18:49.172Z","author":{"_id":"6352aa7b6cfb8f149814de5e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666361939036-noauth.jpeg","fullname":"Tianduo Wang","name":"Tianduo","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7271976470947266},"editors":["Tianduo"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1666361939036-noauth.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"66adfe7ec7b16df48d650e61","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false},"createdAt":"2024-08-03T09:55:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"cc @kashif","html":"

cc \n\n@kashif\n\t

\n","updatedAt":"2024-08-03T09:55:10.435Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"es","probability":0.24117206037044525},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66a84d89db77470d3b3a6270"}},{"id":"66ae01071ca0b39e0b0a878a","author":{"_id":"629f3b18ee05727ce328ccbe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669189789447-629f3b18ee05727ce328ccbe.jpeg","fullname":"Kashif Rasul","name":"kashif","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":378,"isUserFollowing":false},"createdAt":"2024-08-03T10:05:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"thanks!","html":"

thanks!

\n","updatedAt":"2024-08-03T10:05:59.775Z","author":{"_id":"629f3b18ee05727ce328ccbe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669189789447-629f3b18ee05727ce328ccbe.jpeg","fullname":"Kashif Rasul","name":"kashif","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":378,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7913875579833984},"editors":["kashif"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1669189789447-629f3b18ee05727ce328ccbe.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66a84d89db77470d3b3a6270"}}]},{"id":"66a99242cc3669ef962161b6","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-07-31T01:24:18.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs](https://huggingface.co/papers/2406.18629) (2024)\n* [PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs](https://huggingface.co/papers/2406.02886) (2024)\n* [Teaching Language Models to Self-Improve by Learning from Language Feedback](https://huggingface.co/papers/2406.07168) (2024)\n* [Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning](https://huggingface.co/papers/2406.12050) (2024)\n* [Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation from Imperfect Teacher Models in Low-Budget Scenarios](https://huggingface.co/papers/2406.05322) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-07-31T01:24:18.160Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7516230940818787},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2407.18248","authors":[{"_id":"66a740d7577ca69aa4c1c4a8","user":{"_id":"6352aa7b6cfb8f149814de5e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666361939036-noauth.jpeg","isPro":false,"fullname":"Tianduo Wang","user":"Tianduo","type":"user"},"name":"Tianduo Wang","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:38:18.265Z","hidden":false},{"_id":"66a740d7577ca69aa4c1c4a9","name":"Shichen Li","hidden":false},{"_id":"66a740d7577ca69aa4c1c4aa","user":{"_id":"6447f0ed19538c015b27ecb8","avatarUrl":"/avatars/aae591acfe36e3297d57103d1bda12cf.svg","isPro":false,"fullname":"Wei Lu","user":"weilu","type":"user"},"name":"Wei Lu","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:39:47.467Z","hidden":false}],"publishedAt":"2024-07-25T17:59:16.000Z","submittedOnDailyAt":"2024-07-30T00:48:49.166Z","title":"Self-Training with Direct Preference Optimization Improves\n Chain-of-Thought Reasoning","submittedOnDailyBy":{"_id":"6352aa7b6cfb8f149814de5e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666361939036-noauth.jpeg","isPro":false,"fullname":"Tianduo Wang","user":"Tianduo","type":"user"},"summary":"Effective training of language models (LMs) for mathematical reasoning tasks\ndemands high-quality supervised fine-tuning data. Besides obtaining annotations\nfrom human experts, a common alternative is sampling from larger and more\npowerful LMs. However, this knowledge distillation approach can be costly and\nunstable, particularly when relying on closed-source, proprietary LMs like\nGPT-4, whose behaviors are often unpredictable. In this work, we demonstrate\nthat the reasoning abilities of small-scale LMs can be enhanced through\nself-training, a process where models learn from their own outputs. We also\nshow that the conventional self-training can be further augmented by a\npreference learning algorithm called Direct Preference Optimization (DPO). By\nintegrating DPO into self-training, we leverage preference data to guide LMs\ntowards more accurate and diverse chain-of-thought reasoning. We evaluate our\nmethod across various mathematical reasoning tasks using different base models.\nOur experiments show that this approach not only improves LMs' reasoning\nperformance but also offers a more cost-effective and scalable solution\ncompared to relying on large proprietary LMs.","upvotes":33,"discussionId":"66a740d8577ca69aa4c1c4cc","githubRepo":"https://github.com/tianduowang/dpo-st","githubRepoAddedBy":"auto","ai_summary":"Enhancing small-scale language models for mathematical reasoning through self-training and preference learning, resulting in improved performance and reduced costs compared to large proprietary models.","ai_keywords":["language models","fine-tuning","knowledge distillation","self-training","Direct Preference Optimization","preference learning","chain-of-thought reasoning","mathematical reasoning tasks"],"githubStars":53},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6352aa7b6cfb8f149814de5e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666361939036-noauth.jpeg","isPro":false,"fullname":"Tianduo Wang","user":"Tianduo","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"64ae8b41c21d9baacbaa4d98","avatarUrl":"/avatars/82a09c5f390d7fb34b7092e03b793155.svg","isPro":false,"fullname":"Ningxin Peng","user":"nxpeng9235","type":"user"},{"_id":"662c7c0a7b24fe598b317539","avatarUrl":"/avatars/9c578f8a1b9e5f85bdcfe0dc7465e0b4.svg","isPro":false,"fullname":"xuuuluuu","user":"xuuuluuu","type":"user"},{"_id":"632ab0407fb39c2b6350c10a","avatarUrl":"/avatars/fb02cad2a017654965486418bf370157.svg","isPro":false,"fullname":"Cheng","user":"Shanbo","type":"user"},{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","isPro":false,"fullname":"Michael Barry","user":"MichaelBarryUK","type":"user"},{"_id":"6409f386f3dabf93824bdcd2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6409f386f3dabf93824bdcd2/4OsM9ur7C65QRDYPUuD4I.jpeg","isPro":false,"fullname":"Ougrid Dumdang","user":"Ougrid-D","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"640eda4abd54b1efde315bcc","avatarUrl":"/avatars/c7943bb5c59c0e8df070e7144fd44c0e.svg","isPro":false,"fullname":"Fused Ion ","user":"fusedion","type":"user"},{"_id":"66897f09751b78b4fd45bb18","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66897f09751b78b4fd45bb18/GIAzo--2d3cgPssab8DIH.jpeg","isPro":false,"fullname":"Jerrold Miranda","user":"jerroldmiranda","type":"user"},{"_id":"6689831017212ba35ebe99ff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6689831017212ba35ebe99ff/lumbmtriGbHsHJw_VJ41m.jpeg","isPro":false,"fullname":"Jeffery Tyler","user":"jefferytyler","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2407.18248

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Published on Jul 25, 2024

· Submitted by

Tianduo Wang on Jul 30, 2024

Upvote

Authors:

Tianduo Wang ,

Wei Lu

Abstract

Enhancing small-scale language models for mathematical reasoning through self-training and preference learning, resulting in improved performance and reduced costs compared to large proprietary models.

AI-generated summary

Effective training of language models (LMs) for mathematical reasoning tasks demands high-quality supervised fine-tuning data. Besides obtaining annotations from human experts, a common alternative is sampling from larger and more powerful LMs. However, this knowledge distillation approach can be costly and unstable, particularly when relying on closed-source, proprietary LMs like GPT-4, whose behaviors are often unpredictable. In this work, we demonstrate that the reasoning abilities of small-scale LMs can be enhanced through self-training, a process where models learn from their own outputs. We also show that the conventional self-training can be further augmented by a preference learning algorithm called Direct Preference Optimization (DPO). By integrating DPO into self-training, we leverage preference data to guide LMs towards more accurate and diverse chain-of-thought reasoning. We evaluate our method across various mathematical reasoning tasks using different base models. Our experiments show that this approach not only improves LMs' reasoning performance but also offers a more cost-effective and scalable solution compared to relying on large proprietary LMs.