Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs
[go: Go Back, main page]

arXiv:2602.10388\n
  • Code: GitHub
  • \n
  • Website: https://website-sigma-three-35.vercel.app/
  • \n
  • Demo: https://huggingface.co/spaces/Zhongzhi1228/synthesis-demo (Work in Progress)
  • \n\n","updatedAt":"2026-02-12T04:30:19.474Z","author":{"_id":"6951c555b519522f565dfd0c","avatarUrl":"/avatars/9028d619483f359639ae7bfe4769da45.svg","fullname":"ZhongzhiLi","name":"Zhongzhi1228","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.7969135046005249},"editors":["Zhongzhi1228"],"editorAvatarUrls":["/avatars/9028d619483f359639ae7bfe4769da45.svg"],"reactions":[{"reaction":"👍","users":["ethan1994","NeoLiu43","plusn","Zhongzhi1228","iseedeep"],"count":5}],"isReport":false}},{"id":"6992a4aeb5d3127126b3a2b1","author":{"_id":"6951c555b519522f565dfd0c","avatarUrl":"/avatars/9028d619483f359639ae7bfe4769da45.svg","fullname":"ZhongzhiLi","name":"Zhongzhi1228","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-16T05:01:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"We introduce Feature Activation Coverage (FAC), an interpretable metric that measures data diversity in the feature space of LLMs rather than surface text variation. \nBuilding on FAC, we propose a FAC-guided data synthesis framework that identifies missing functional features and generates targeted synthetic data to fill coverage gaps. \nExperiments across reward modeling, toxic detection, and controllable generation show that FAC-guided synthesis significantly improves downstream performance with much less data.","html":"

    We introduce Feature Activation Coverage (FAC), an interpretable metric that measures data diversity in the feature space of LLMs rather than surface text variation.
    Building on FAC, we propose a FAC-guided data synthesis framework that identifies missing functional features and generates targeted synthetic data to fill coverage gaps.
    Experiments across reward modeling, toxic detection, and controllable generation show that FAC-guided synthesis significantly improves downstream performance with much less data.

    \n","updatedAt":"2026-02-16T05:01:34.718Z","author":{"_id":"6951c555b519522f565dfd0c","avatarUrl":"/avatars/9028d619483f359639ae7bfe4769da45.svg","fullname":"ZhongzhiLi","name":"Zhongzhi1228","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.872283935546875},"editors":["Zhongzhi1228"],"editorAvatarUrls":["/avatars/9028d619483f359639ae7bfe4769da45.svg"],"reactions":[{"reaction":"🚀","users":["Zhongzhi1228","elchulito89"],"count":2}],"isReport":false}},{"id":"6993c6f03a18b487423a1e42","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-17T01:40:00.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs](https://huggingface.co/papers/2601.11019) (2026)\n* [UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models](https://huggingface.co/papers/2512.17385) (2025)\n* [Flatter Tokens are More Valuable for Speculative Draft Model Training](https://huggingface.co/papers/2601.18902) (2026)\n* [Can abstract concepts from LLM improve SLM performance?](https://huggingface.co/papers/2512.19069) (2025)\n* [Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs](https://huggingface.co/papers/2602.11181) (2026)\n* [Chunky Post-Training: Data Driven Failures of Generalization](https://huggingface.co/papers/2602.05910) (2026)\n* [Steering Language Models Before They Speak: Logit-Level Interventions](https://huggingface.co/papers/2601.10960) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

    This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

    \n

    The following papers were recommended by the Semantic Scholar API

    \n\n

    Please give a thumbs up to this comment if you found it helpful!

    \n

    If you want recommendations for any Paper on Hugging Face checkout this Space

    \n

    You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

    \n","updatedAt":"2026-02-17T01:40:00.803Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7501335144042969},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"69953e5291b59ecb7e1fe943","author":{"_id":"67f7f4e48e943d320721cacf","avatarUrl":"/avatars/ae15a0869b944b918218aaa18624dd6c.svg","fullname":"Tom","name":"Swift56Tom","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-18T04:21:38.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is a fascinating approach! 'Less-is-Enough' just goes to show that it's not about brute-forcing quantity, and there is still so much room to explore with synthetic data in LLM fine-tuning. Great work! ","html":"

    This is a fascinating approach! 'Less-is-Enough' just goes to show that it's not about brute-forcing quantity, and there is still so much room to explore with synthetic data in LLM fine-tuning. Great work!

    \n","updatedAt":"2026-02-18T04:21:38.074Z","author":{"_id":"67f7f4e48e943d320721cacf","avatarUrl":"/avatars/ae15a0869b944b918218aaa18624dd6c.svg","fullname":"Tom","name":"Swift56Tom","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9473615884780884},"editors":["Swift56Tom"],"editorAvatarUrls":["/avatars/ae15a0869b944b918218aaa18624dd6c.svg"],"reactions":[{"reaction":"❤️","users":["Zhongzhi1228"],"count":1}],"isReport":false}},{"id":"6995df43aa3c4d560635a05d","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-02-18T15:48:19.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/less-is-enough-synthesizing-diverse-data-in-feature-space-of-llms-2195-a577695c\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"

    arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/less-is-enough-synthesizing-diverse-data-in-feature-space-of-llms-2195-a577695c

    \n
      \n
    • Executive Summary
    • \n
    • Detailed Breakdown
    • \n
    • Practical Applications
    • \n
    \n","updatedAt":"2026-02-18T15:48:19.687Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7307891249656677},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.10388","authors":[{"_id":"698d3bd265c0d15a6d16200e","user":{"_id":"6951c555b519522f565dfd0c","avatarUrl":"/avatars/9028d619483f359639ae7bfe4769da45.svg","isPro":false,"fullname":"ZhongzhiLi","user":"Zhongzhi1228","type":"user"},"name":"Zhongzhi Li","status":"claimed_verified","statusLastChangedAt":"2026-02-12T13:57:05.580Z","hidden":false},{"_id":"698d3bd265c0d15a6d16200f","name":"Xuansheng Wu","hidden":false},{"_id":"698d3bd265c0d15a6d162010","name":"Yijiang Li","hidden":false},{"_id":"698d3bd265c0d15a6d162011","name":"Lijie Hu","hidden":false},{"_id":"698d3bd265c0d15a6d162012","name":"Ninghao Liu","hidden":false}],"publishedAt":"2026-02-11T00:23:13.000Z","submittedOnDailyAt":"2026-02-16T02:31:34.708Z","title":"Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs","submittedOnDailyBy":{"_id":"6951c555b519522f565dfd0c","avatarUrl":"/avatars/9028d619483f359639ae7bfe4769da45.svg","isPro":false,"fullname":"ZhongzhiLi","user":"Zhongzhi1228","type":"user"},"summary":"The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.","upvotes":218,"discussionId":"698d3bd265c0d15a6d162013","projectPage":"https://website-sigma-three-35.vercel.app/","githubRepo":"https://github.com/Zhongzhi660/FAC-Synthesis","githubRepoAddedBy":"user","ai_summary":"Feature Activation Coverage measures data diversity in an interpretable feature space and enables diversity-driven data synthesis that improves downstream performance across multiple language model architectures.","ai_keywords":["Feature Activation Coverage","sparse autoencoder","data diversity","downstream performance","instruction following","toxicity detection","reward modeling","behavior steering","cross-model knowledge transfer","data-centric optimization"],"githubStars":109},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6951c555b519522f565dfd0c","avatarUrl":"/avatars/9028d619483f359639ae7bfe4769da45.svg","isPro":false,"fullname":"ZhongzhiLi","user":"Zhongzhi1228","type":"user"},{"_id":"63fac64d6b75d93aa13616e0","avatarUrl":"/avatars/573be0f4fe4a206700aa972629e79abf.svg","isPro":false,"fullname":"Jiaxi Li","user":"plusn","type":"user"},{"_id":"637217f114f252a903eb0e99","avatarUrl":"/avatars/287f0157a2e9c368c7197df4db183ec6.svg","isPro":false,"fullname":"LiZhongzhi","user":"ZhongzhiLi","type":"user"},{"_id":"6419309f22270b3ccf177c77","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6419309f22270b3ccf177c77/KQa1586iBBKqucUlfpuPp.jpeg","isPro":false,"fullname":"William Li","user":"williamium","type":"user"},{"_id":"698f8a67fd25ccc6895eea00","avatarUrl":"/avatars/d606ea541ce533d70e6b4b7c9788aa89.svg","isPro":false,"fullname":"Dd2mge6dam0wx","user":"dd2mge6dam0wx","type":"user"},{"_id":"698fcd99e1b31e94e9e0fb79","avatarUrl":"/avatars/3784d66d87a98a78440a85ecd2de68a3.svg","isPro":false,"fullname":"Vr1m2unpdag","user":"vr1m2unpdag","type":"user"},{"_id":"6984dd3a8653929a9c2a5bdf","avatarUrl":"/avatars/59a469d103dff7db09ecca70449b6413.svg","isPro":false,"fullname":"Christopher Lee","user":"ljohiresh02","type":"user"},{"_id":"6981c4605337cd330e221b65","avatarUrl":"/avatars/f69a89100322333bea4038eaa6af319f.svg","isPro":false,"fullname":"johnaldw","user":"johnaldw","type":"user"},{"_id":"69830e8395375aec9d750f40","avatarUrl":"/avatars/6e2dde4de90a4b8c3d8cc34b81375303.svg","isPro":false,"fullname":"Ethan Shapiro","user":"ethan1994","type":"user"},{"_id":"6981c3a9a8837d6165527d90","avatarUrl":"/avatars/fec26733daa205c3cfc73a322c7595c6.svg","isPro":false,"fullname":"wildimsingh","user":"wildimsingh","type":"user"},{"_id":"698304bc3ab1ff8c77629d87","avatarUrl":"/avatars/4c739a0fa49574a2875994a6318b413b.svg","isPro":false,"fullname":"Алексей Морозов","user":"cyber-scout","type":"user"},{"_id":"698f94336a82ac3be8daa3ba","avatarUrl":"/avatars/8778f90d212f382f52ac07e1bc1cfc5f.svg","isPro":false,"fullname":"Hyh426twgxcw","user":"hyh426twgxcw","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
    Papers
    arxiv:2602.10388

    Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

    Published on Feb 11
    · Submitted by
    ZhongzhiLi
    on Feb 16
    Authors:
    ,
    ,
    ,

    Abstract

    Feature Activation Coverage measures data diversity in an interpretable feature space and enables diversity-driven data synthesis that improves downstream performance across multiple language model architectures.

    AI-generated summary

    The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

    Community

    Paper author Paper submitter
    edited 8 days ago

    Less is Enough shows that better data matters more than more data.

    Instead of generating massive amounts of synthetic sample, we look inside the model’s hidden features to find what is truly missing. We introduce Feature Activation Coverage (FAC) to measure which important internal features are underrepresented, then generate new samples that specifically activate those features.

    Result: FAC exhibits a strong correlation with downstream performance. Increasing FAC brings significantly larger gains than simply adding more samples. With only 2K synthetic samples, we match MAGPIE’s performance on AlpacaEval 2.0 (which uses 300K samples) and outperform strong baselines across instruction following, toxicity detection, reward modeling, and behavior steering.

    Interestingly, we further discover a shared, interpretable feature space across LLaMA, Mistral, and Qwen, which enables effective cross-model knowledge transfer between different model families.

    Paper author Paper submitter

    We introduce Feature Activation Coverage (FAC), an interpretable metric that measures data diversity in the feature space of LLMs rather than surface text variation.
    Building on FAC, we propose a FAC-guided data synthesis framework that identifies missing functional features and generates targeted synthetic data to fill coverage gaps.
    Experiments across reward modeling, toxic detection, and controllable generation show that FAC-guided synthesis significantly improves downstream performance with much less data.

    This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

    The following papers were recommended by the Semantic Scholar API

    Please give a thumbs up to this comment if you found it helpful!

    If you want recommendations for any Paper on Hugging Face checkout this Space

    You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

    This is a fascinating approach! 'Less-is-Enough' just goes to show that it's not about brute-forcing quantity, and there is still so much room to explore with synthetic data in LLM fine-tuning. Great work!

    arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/less-is-enough-synthesizing-diverse-data-in-feature-space-of-llms-2195-a577695c

    • Executive Summary
    • Detailed Breakdown
    • Practical Applications

    Sign up or log in to comment

    Models citing this paper 3

    Datasets citing this paper 0

    No dataset linking this paper

    Cite arxiv.org/abs/2602.10388 in a dataset README.md to link it from this page.

    Spaces citing this paper 0

    No Space linking this paper

    Cite arxiv.org/abs/2602.10388 in a Space README.md to link it from this page.

    Collections including this paper 3