Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - MiCRo: Mixture Modeling and Context-aware Routing for Personalized
Preference Learning
\n","updatedAt":"2025-06-03T05:21:59.126Z","author":{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","fullname":"Rui Yang","name":"Ray2333","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.311442106962204},"editors":["Ray2333"],"editorAvatarUrls":["/avatars/9bb8205b889337df5d321539c9b5d69d.svg"],"reactions":[],"isReport":false}},{"id":"683fa426e1ff7dc483dad236","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-06-04T01:40:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LoRe: Personalizing LLMs via Low-Rank Reward Modeling](https://huggingface.co/papers/2504.14439) (2025)\n* [Energy-Based Reward Models for Robust Language Model Alignment](https://huggingface.co/papers/2504.13134) (2025)\n* [Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment](https://huggingface.co/papers/2504.12663) (2025)\n* [PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model](https://huggingface.co/papers/2505.06274) (2025)\n* [Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment](https://huggingface.co/papers/2505.10597) (2025)\n* [Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes](https://huggingface.co/papers/2505.04993) (2025)\n* [MOSLIM:Align with diverse preferences in prompts through reward classification](https://huggingface.co/papers/2505.20336) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-06-04T01:40:54.425Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6881287693977356},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.24846","authors":[{"_id":"683e82f2fa7ede4842f95214","name":"Jingyan Shen","hidden":false},{"_id":"683e82f2fa7ede4842f95215","user":{"_id":"66f8689725464a7989b75845","avatarUrl":"/avatars/43a61a528c5779103eaf5687ba44ee14.svg","isPro":false,"fullname":"Jiarui Yao","user":"FlippyDora","type":"user"},"name":"Jiarui Yao","status":"claimed_verified","statusLastChangedAt":"2025-06-03T08:40:58.100Z","hidden":false},{"_id":"683e82f2fa7ede4842f95216","user":{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","isPro":true,"fullname":"Rui Yang","user":"Ray2333","type":"user"},"name":"Rui Yang","status":"claimed_verified","statusLastChangedAt":"2025-06-03T08:40:16.572Z","hidden":true},{"_id":"683e82f2fa7ede4842f95217","name":"Yifan Sun","hidden":false},{"_id":"683e82f2fa7ede4842f95218","name":"Feng Luo","hidden":false},{"_id":"683e82f2fa7ede4842f95219","name":"Rui Pan","hidden":false},{"_id":"683e82f2fa7ede4842f9521a","name":"Tong Zhang","hidden":false},{"_id":"683e82f2fa7ede4842f9521b","name":"Han Zhao","hidden":false}],"publishedAt":"2025-05-30T17:44:28.000Z","submittedOnDailyAt":"2025-06-03T03:51:59.115Z","title":"MiCRo: Mixture Modeling and Context-aware Routing for Personalized\n Preference Learning","submittedOnDailyBy":{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","isPro":true,"fullname":"Rui Yang","user":"Ray2333","type":"user"},"summary":"Reward modeling is a key step in building safe foundation models when\napplying reinforcement learning from human feedback (RLHF) to align Large\nLanguage Models (LLMs). However, reward modeling based on the Bradley-Terry\n(BT) model assumes a global reward function, failing to capture the inherently\ndiverse and heterogeneous human preferences. Hence, such oversimplification\nlimits LLMs from supporting personalization and pluralistic alignment.\nTheoretically, we show that when human preferences follow a mixture\ndistribution of diverse subgroups, a single BT model has an irreducible error.\nWhile existing solutions, such as multi-objective learning with fine-grained\nannotations, help address this issue, they are costly and constrained by\npredefined attributes, failing to fully capture the richness of human values.\nIn this work, we introduce MiCRo, a two-stage framework that enhances\npersonalized preference learning by leveraging large-scale binary preference\ndatasets without requiring explicit fine-grained annotations. In the first\nstage, MiCRo introduces context-aware mixture modeling approach to capture\ndiverse human preferences. In the second stage, MiCRo integrates an online\nrouting strategy that dynamically adapts mixture weights based on specific\ncontext to resolve ambiguity, allowing for efficient and scalable preference\nadaptation with minimal additional supervision. Experiments on multiple\npreference datasets demonstrate that MiCRo effectively captures diverse human\npreferences and significantly improves downstream personalization.","upvotes":15,"discussionId":"683e82f3fa7ede4842f95246","ai_summary":"MiCRo, a two-stage framework, improves personalized preference learning for large language models by leveraging binary preference datasets and dynamically adapting mixture weights based on context, effectively capturing diverse human preferences.","ai_keywords":["Reward modeling","reinforcement learning from human feedback (RLHF)","Large Language Models (LLMs)","Bradley-Terry (BT) model","mixture distribution","personalization","pluralistic alignment","multi-objective learning","context-aware mixture modeling","online routing strategy"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","isPro":true,"fullname":"Rui Yang","user":"Ray2333","type":"user"},{"_id":"66f9bb2dd5575ad6914756ce","avatarUrl":"/avatars/221d915a5386cbb11c007dc7c41d6b0a.svg","isPro":true,"fullname":"Feng Luo","user":"feng0929","type":"user"},{"_id":"6363a4f4ff4b318d1b775420","avatarUrl":"/avatars/c709a528db30fd81865de040710b4578.svg","isPro":false,"fullname":"Luo","user":"amandaa","type":"user"},{"_id":"64cb1ad1667f4f80852f6050","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cb1ad1667f4f80852f6050/iOn5q_RyyBS99tObrO5Tc.png","isPro":false,"fullname":"Rui Pan","user":"research4pan","type":"user"},{"_id":"638828cd26952adc66f1bdbd","avatarUrl":"/avatars/0be4c4af8f3d1ed9529bc77839952dab.svg","isPro":false,"fullname":"Evangeline Shen","user":"Evangelinejy","type":"user"},{"_id":"66f8689725464a7989b75845","avatarUrl":"/avatars/43a61a528c5779103eaf5687ba44ee14.svg","isPro":false,"fullname":"Jiarui Yao","user":"FlippyDora","type":"user"},{"_id":"665e121c6007027038fd4005","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/sIVBJAGM-Kneq9KMf8aXb.png","isPro":false,"fullname":"Cheng Qian","user":"chengq9","type":"user"},{"_id":"6270ff726417aed8a7340c8b","avatarUrl":"/avatars/3f14913c55cc4fc78678ac43fb603e80.svg","isPro":false,"fullname":"Xiusi Chen","user":"XtremSup","type":"user"},{"_id":"65f906e5c3dbdcae83ff7aac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65f906e5c3dbdcae83ff7aac/mdjiVkLDJgJcGLwv0rMe4.jpeg","isPro":false,"fullname":"Hongru Wang","user":"Merlin-Hongru","type":"user"},{"_id":"64eda4909e28bbb8996e4002","avatarUrl":"/avatars/02e1eef34a5bad6bd408bd9b2ba0dfb7.svg","isPro":false,"fullname":"Jiaxin Qin","user":"JiaxinQin-cc","type":"user"},{"_id":"68087b4f3f5cc7179ae959a7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/l9skgMVKXJollx6BwNaWm.png","isPro":false,"fullname":"Xiaocheng Yang","user":"Xiaocheng-Yang","type":"user"},{"_id":"674088e62fbb98c431a3d3cb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/nZDPIaQwejU__XY31OyCf.png","isPro":false,"fullname":"Serkan Can","user":"serkancancaglayan","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
MiCRo, a two-stage framework, improves personalized preference learning for large language models by leveraging binary preference datasets and dynamically adapting mixture weights based on context, effectively capturing diverse human preferences.
AI-generated summary
Reward modeling is a key step in building safe foundation models when
applying reinforcement learning from human feedback (RLHF) to align Large
Language Models (LLMs). However, reward modeling based on the Bradley-Terry
(BT) model assumes a global reward function, failing to capture the inherently
diverse and heterogeneous human preferences. Hence, such oversimplification
limits LLMs from supporting personalization and pluralistic alignment.
Theoretically, we show that when human preferences follow a mixture
distribution of diverse subgroups, a single BT model has an irreducible error.
While existing solutions, such as multi-objective learning with fine-grained
annotations, help address this issue, they are costly and constrained by
predefined attributes, failing to fully capture the richness of human values.
In this work, we introduce MiCRo, a two-stage framework that enhances
personalized preference learning by leveraging large-scale binary preference
datasets without requiring explicit fine-grained annotations. In the first
stage, MiCRo introduces context-aware mixture modeling approach to capture
diverse human preferences. In the second stage, MiCRo integrates an online
routing strategy that dynamically adapts mixture weights based on specific
context to resolve ambiguity, allowing for efficient and scalable preference
adaptation with minimal additional supervision. Experiments on multiple
preference datasets demonstrate that MiCRo effectively captures diverse human
preferences and significantly improves downstream personalization.