Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
[go: Go Back, main page]

\"image\"

\n

Concretely, OPUS operationalizes this idea through a principled objective, a scalable estimator, and a diversity-preserving selection rule. Our key contributions are: (1) an optimizer-aware utility for dynamic selection, with closed-form approximations for effective update directions under AdamW and Muon, aligning scoring with real training geometry; (2) BENCH-PROXY, an in-distribution proxy construction method that retrieves benchmark-aligned samples from the pre-training corpus to stabilize the target direction; (3) scalable utility estimation using the Ghost technique + CountSketch projections to avoid per-sample gradient materialization; and (4) Boltzmann sampling with redundancy control to prevent diversity collapse under non-stationary streams. Empirically, OPUS delivers strong data/compute efficiency: it reports only ~4.7% additional compute overhead for selection while achieving large gains across datasets, optimizers, and scales—including improved accuracy (+2.2% average over 10 benchmarks and 8× compute reduction in one highlighted setting), outperforming industrial static/dynamic baselines and even matching or exceeding much longer-token training in several regimes.

\n","updatedAt":"2026-02-11T04:39:03.956Z","author":{"_id":"67e617d4470f96a302734e16","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/QHrYmNlTRxKR1KRS50pkf.png","fullname":"Xuan Ouyang","name":"YoungXuan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":24,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8497273325920105},"editors":["YoungXuan"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/QHrYmNlTRxKR1KRS50pkf.png"],"reactions":[{"reaction":"🔥","users":["tianyi0216","YoungXuan","Shawn1009","wadham-bottacin","JaehoHan"],"count":5}],"isReport":false}},{"id":"698c8f8e5d6795a4347d56eb","author":{"_id":"60a551a34ecc5d054c8ad93e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60a551a34ecc5d054c8ad93e/dhcBFtwNLcKqqASxniyVw.jpeg","fullname":"Mishig Davaadorj","name":"mishig","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":307,"isUserFollowing":false},"createdAt":"2026-02-11T14:17:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"# Main Results of OPUS\n\n## Overview\n**OPUS** (Optimizer-induced Projected Utility Selection) is a dynamic data selection framework for LLM pre-training that aligns data selection with the optimizer's actual update geometry (supporting both AdamW and Muon optimizers). It achieves superior data efficiency with minimal computational overhead.\n\n## Key Quantitative Results\n\n### 1. Pre-training from Scratch (FineWeb, 30B tokens)\n**Figure 1 & Table 3**: OPUS outperforms all compute-matched baselines across model scales and optimizers:\n\n- **GPT-2 XL (Muon)**: Achieves **41.75%** average accuracy vs. **40.29%** for random selection (1.46 point improvement), outperforming even the **60B-token random baseline (41.29%)** while using only half the tokens.\n- **GPT-2 Large (AdamW)**: Achieves **41.43%** vs. **39.29%** for random (2.14 point improvement).\n- **Cross-optimizer consistency**: OPUS achieves best compute-matched performance under both **Muon** (matrix preconditioning) and **AdamW** (diagonal preconditioning), validating that optimizer-aware selection matters.\n\n### 2. Robustness to Data Quality (FineWeb-Edu)\n**Table 4**: OPUS demonstrates remarkable efficiency even with lower-quality data:\n- When selecting from **mid-quality data (score 3)**, OPUS matches or exceeds static baselines trained on **high-quality data (scores 4-5)**.\n- **GPT-2 XL (Muon)**: OPUS achieves **44.99%** average accuracy when selecting from score-3 data, outperforming all baselines trained on the superior score-4/5 partition (best baseline: 42.59%).\n\n### 3. Continued Pre-training Efficiency\n**Figure 5 & Figure 6**: On Qwen3-8B-Base continued pre-training with SciencePedia:\n- OPUS achieves **6× data efficiency**: Using only **0.5B tokens**, OPUS outperforms full training with **3B tokens** on science benchmarks (OlympicArena and SciAssess).\n- Superior performance in specialized domains (physics, chemistry, biology, medicine, materials science).\n\n### 4. Computational Efficiency\n**Figure 7**: \n- **Overhead**: Only **4.7%** additional compute cost compared to random selection.\n- Contrast with naive dynamic selection implementations that incur **>3.5× slowdown**.\n- Achieved through Ghost technique + CountSketch projections.\n\n## Comparative Performance\n\n### vs. Static Methods\nOPUS consistently outperforms industrial-level static filters:\n- **QuRating, DSIR, DCLM-FastText, FineWeb-Edu, UltraFineweb** (Table 3, Table 4)\n- Static methods suffer from training-agnostic heuristics; OPUS adapts to model state.\n\n### vs. Dynamic Methods\n- **High-PPL (perplexity-based)**: OPUS beats by ~2% average accuracy.\n- **GREATS**: OPUS outperforms while being more scalable (GREATS assumes SGD geometry; OPUS handles adaptive optimizers correctly).\n\n## Ablation Insights\n\n**Table 7 & Table 8**:\n- **Boltzmann sampling** (temperature τ=0.9) outperforms greedy top-k selection (41.75% vs. 40.49%), preventing diversity collapse.\n- **Bench-Proxy** (benchmark-aligned proxy) improves over standard proxy (41.75% vs. 41.03%).\n- Robust to hyperparameters: Works across buffer sizes (16-64), projection dimensions (4096-16384).\n\n## Qualitative Analysis\n**Appendix A**: OPUS selects a more diverse mixture of documents (instructional content + general web text) compared to:\n- **High-PPL**: Concentrates on high-loss but potentially noisy samples.\n- **QuRating**: Extreme preference for \"educational\" patterns only.\n- **Static filters**: Fixed heuristics that don't adapt to training dynamics.\n\n## Summary\nOPUS achieves **8× computation reduction** on GPT-XL while improving accuracy by **2.2%** over random selection. It is the first dynamic selection method that properly accounts for modern optimizer geometries (AdamW, Muon), enabling principled, scalable, and diverse data selection at every training iteration with only **4.7% overhead**.","html":"

\n\t\n\t\t\n\t\n\t\n\t\tMain Results of OPUS\n\t\n

\n

\n\t\n\t\t\n\t\n\t\n\t\tOverview\n\t\n

\n

OPUS (Optimizer-induced Projected Utility Selection) is a dynamic data selection framework for LLM pre-training that aligns data selection with the optimizer's actual update geometry (supporting both AdamW and Muon optimizers). It achieves superior data efficiency with minimal computational overhead.

\n

\n\t\n\t\t\n\t\n\t\n\t\tKey Quantitative Results\n\t\n

\n

\n\t\n\t\t\n\t\n\t\n\t\t1. Pre-training from Scratch (FineWeb, 30B tokens)\n\t\n

\n

Figure 1 & Table 3: OPUS outperforms all compute-matched baselines across model scales and optimizers:

\n
    \n
  • GPT-2 XL (Muon): Achieves 41.75% average accuracy vs. 40.29% for random selection (1.46 point improvement), outperforming even the 60B-token random baseline (41.29%) while using only half the tokens.
  • \n
  • GPT-2 Large (AdamW): Achieves 41.43% vs. 39.29% for random (2.14 point improvement).
  • \n
  • Cross-optimizer consistency: OPUS achieves best compute-matched performance under both Muon (matrix preconditioning) and AdamW (diagonal preconditioning), validating that optimizer-aware selection matters.
  • \n
\n

\n\t\n\t\t\n\t\n\t\n\t\t2. Robustness to Data Quality (FineWeb-Edu)\n\t\n

\n

Table 4: OPUS demonstrates remarkable efficiency even with lower-quality data:

\n
    \n
  • When selecting from mid-quality data (score 3), OPUS matches or exceeds static baselines trained on high-quality data (scores 4-5).
  • \n
  • GPT-2 XL (Muon): OPUS achieves 44.99% average accuracy when selecting from score-3 data, outperforming all baselines trained on the superior score-4/5 partition (best baseline: 42.59%).
  • \n
\n

\n\t\n\t\t\n\t\n\t\n\t\t3. Continued Pre-training Efficiency\n\t\n

\n

Figure 5 & Figure 6: On Qwen3-8B-Base continued pre-training with SciencePedia:

\n
    \n
  • OPUS achieves 6× data efficiency: Using only 0.5B tokens, OPUS outperforms full training with 3B tokens on science benchmarks (OlympicArena and SciAssess).
  • \n
  • Superior performance in specialized domains (physics, chemistry, biology, medicine, materials science).
  • \n
\n

\n\t\n\t\t\n\t\n\t\n\t\t4. Computational Efficiency\n\t\n

\n

Figure 7:

\n
    \n
  • Overhead: Only 4.7% additional compute cost compared to random selection.
  • \n
  • Contrast with naive dynamic selection implementations that incur >3.5× slowdown.
  • \n
  • Achieved through Ghost technique + CountSketch projections.
  • \n
\n

\n\t\n\t\t\n\t\n\t\n\t\tComparative Performance\n\t\n

\n

\n\t\n\t\t\n\t\n\t\n\t\tvs. Static Methods\n\t\n

\n

OPUS consistently outperforms industrial-level static filters:

\n
    \n
  • QuRating, DSIR, DCLM-FastText, FineWeb-Edu, UltraFineweb (Table 3, Table 4)
  • \n
  • Static methods suffer from training-agnostic heuristics; OPUS adapts to model state.
  • \n
\n

\n\t\n\t\t\n\t\n\t\n\t\tvs. Dynamic Methods\n\t\n

\n
    \n
  • High-PPL (perplexity-based): OPUS beats by ~2% average accuracy.
  • \n
  • GREATS: OPUS outperforms while being more scalable (GREATS assumes SGD geometry; OPUS handles adaptive optimizers correctly).
  • \n
\n

\n\t\n\t\t\n\t\n\t\n\t\tAblation Insights\n\t\n

\n

Table 7 & Table 8:

\n
    \n
  • Boltzmann sampling (temperature τ=0.9) outperforms greedy top-k selection (41.75% vs. 40.49%), preventing diversity collapse.
  • \n
  • Bench-Proxy (benchmark-aligned proxy) improves over standard proxy (41.75% vs. 41.03%).
  • \n
  • Robust to hyperparameters: Works across buffer sizes (16-64), projection dimensions (4096-16384).
  • \n
\n

\n\t\n\t\t\n\t\n\t\n\t\tQualitative Analysis\n\t\n

\n

Appendix A: OPUS selects a more diverse mixture of documents (instructional content + general web text) compared to:

\n
    \n
  • High-PPL: Concentrates on high-loss but potentially noisy samples.
  • \n
  • QuRating: Extreme preference for \"educational\" patterns only.
  • \n
  • Static filters: Fixed heuristics that don't adapt to training dynamics.
  • \n
\n

\n\t\n\t\t\n\t\n\t\n\t\tSummary\n\t\n

\n

OPUS achieves 8× computation reduction on GPT-XL while improving accuracy by 2.2% over random selection. It is the first dynamic selection method that properly accounts for modern optimizer geometries (AdamW, Muon), enabling principled, scalable, and diverse data selection at every training iteration with only 4.7% overhead.

\n","updatedAt":"2026-02-11T14:17:50.488Z","author":{"_id":"60a551a34ecc5d054c8ad93e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60a551a34ecc5d054c8ad93e/dhcBFtwNLcKqqASxniyVw.jpeg","fullname":"Mishig Davaadorj","name":"mishig","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":307,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7384055256843567},"editors":["mishig"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/60a551a34ecc5d054c8ad93e/dhcBFtwNLcKqqASxniyVw.jpeg"],"reactions":[{"reaction":"👍","users":["Steven-Shaobo","Xusa1","YoungXuan","wadham-bottacin","Smorty100","tianyi0216","Shawn1009"],"count":7},{"reaction":"🚀","users":["Smorty100","tianyi0216","Shawn1009","YoungXuan","Steven-Shaobo"],"count":5}],"isReport":false}},{"id":"698d2fcc2f8c9c37d513d4b2","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-12T01:41:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [IMU-1: Sample-Efficient Pre-training of Small Language Models](https://huggingface.co/papers/2602.02522) (2026)\n* [Uncertainty-Aware Gradient Signal-to-Noise Data Selection for Instruction Tuning](https://huggingface.co/papers/2601.13697) (2026)\n* [MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging](https://huggingface.co/papers/2601.17858) (2026)\n* [TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training](https://huggingface.co/papers/2602.05251) (2026)\n* [ECO: Quantized Training without Full-Precision Master Weights](https://huggingface.co/papers/2601.22101) (2026)\n* [SPICE: Submodular Penalized Information-Conflict Selection for Efficient Large Language Model Training](https://huggingface.co/papers/2601.23155) (2026)\n* [Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training](https://huggingface.co/papers/2602.00747) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-12T01:41:32.286Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7383477091789246},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.05400","authors":[{"_id":"698b396b1b2dc6b37d61b4be","user":{"_id":"66968099c952e09a4cb29f78","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66968099c952e09a4cb29f78/n90NI2R3E9_RqCyMjDCQF.webp","isPro":false,"fullname":"Wang","user":"Steven-Shaobo","type":"user"},"name":"Shaobo Wang","status":"claimed_verified","statusLastChangedAt":"2026-02-11T11:16:57.815Z","hidden":false},{"_id":"698b396b1b2dc6b37d61b4bf","user":{"_id":"67e617d4470f96a302734e16","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/QHrYmNlTRxKR1KRS50pkf.png","isPro":false,"fullname":"Xuan Ouyang","user":"YoungXuan","type":"user"},"name":"Xuan Ouyang","status":"claimed_verified","statusLastChangedAt":"2026-02-11T11:16:55.631Z","hidden":false},{"_id":"698b396b1b2dc6b37d61b4c0","user":{"_id":"6518a144a28f86d3e9e67c34","avatarUrl":"/avatars/f2aed39e971cffe6c9d0b9c2f7a0df70.svg","isPro":false,"fullname":"Tianyi Xu","user":"tianyi0216","type":"user"},"name":"Tianyi Xu","status":"claimed_verified","statusLastChangedAt":"2026-02-11T11:16:53.605Z","hidden":false},{"_id":"698b396b1b2dc6b37d61b4c1","name":"Yuzheng Hu","hidden":false},{"_id":"698b396b1b2dc6b37d61b4c2","name":"Jialin Liu","hidden":false},{"_id":"698b396b1b2dc6b37d61b4c3","name":"Guo Chen","hidden":false},{"_id":"698b396b1b2dc6b37d61b4c4","name":"Tianyu Zhang","hidden":false},{"_id":"698b396b1b2dc6b37d61b4c5","name":"Junhao Zheng","hidden":false},{"_id":"698b396b1b2dc6b37d61b4c6","name":"Kexin Yang","hidden":false},{"_id":"698b396b1b2dc6b37d61b4c7","name":"Xingzhang Ren","hidden":false},{"_id":"698b396b1b2dc6b37d61b4c8","name":"Dayiheng Liu","hidden":false},{"_id":"698b396b1b2dc6b37d61b4c9","name":"Linfeng Zhang","hidden":false}],"publishedAt":"2026-02-05T07:34:23.000Z","submittedOnDailyAt":"2026-02-11T02:09:03.945Z","title":"OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration","submittedOnDailyBy":{"_id":"67e617d4470f96a302734e16","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/QHrYmNlTRxKR1KRS50pkf.png","isPro":false,"fullname":"Xuan Ouyang","user":"YoungXuan","type":"user"},"summary":"As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.","upvotes":318,"discussionId":"698b396b1b2dc6b37d61b4ca","ai_summary":"OPUS is a dynamic data selection framework that improves pre-training efficiency by scoring data candidates based on optimizer-induced update projections in a stable proxy-derived target space, achieving superior performance with reduced computational overhead.","ai_keywords":["data selection","optimizer-induced update space","effective updates","stable in-distribution proxy","Ghost technique","CountSketch","Boltzmann sampling","pre-training","GPT-2","Qwen3-8B-Base","FineWeb","FineWeb-Edu","SciencePedia"],"organization":{"_id":"64c8b5837fe12ecd0a7e92eb","name":"Qwen","fullname":"Qwen","avatar":"https://cdn-uploads.huggingface.co/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66968099c952e09a4cb29f78","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66968099c952e09a4cb29f78/n90NI2R3E9_RqCyMjDCQF.webp","isPro":false,"fullname":"Wang","user":"Steven-Shaobo","type":"user"},{"_id":"66a0caa1a7a6ed88ad1c0ddf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66a0caa1a7a6ed88ad1c0ddf/WoOP24-ruuHy4ryNhRp0D.jpeg","isPro":false,"fullname":"Xuyang Liu","user":"xuyang-liu16","type":"user"},{"_id":"6826f9fbc5e940f8ebb192ff","avatarUrl":"/avatars/34512dbe16494cd536ea3966a86b3d76.svg","isPro":false,"fullname":"kkk","user":"turturtur250","type":"user"},{"_id":"64abcbfde144ba0eb9bb8419","avatarUrl":"/avatars/6ccea0e755bad384aaabd5c455bd962e.svg","isPro":false,"fullname":"Xiangqi Jin","user":"Lueci4er","type":"user"},{"_id":"6806464ed918f6d2fee2bc8b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6806464ed918f6d2fee2bc8b/rgpG2oO0m6PT0KltCF_Wf.jpeg","isPro":false,"fullname":"Chenfei Liao","user":"Chenfei-Liao","type":"user"},{"_id":"669f575da6a8a04c9960b600","avatarUrl":"/avatars/ec1c14022d48a39c7a529e1ddf2c1852.svg","isPro":false,"fullname":"Pei","user":"Hezep","type":"user"},{"_id":"64b22e6b0a54158d66f18688","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b22e6b0a54158d66f18688/Cbl3oMMMANbnCMoSUYenI.png","isPro":true,"fullname":"Benhao Huang","user":"HuskyDoge","type":"user"},{"_id":"6346be8f7fb9f11870c63998","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6346be8f7fb9f11870c63998/tFWawSkXL6bv1zgvzFWQd.png","isPro":false,"fullname":"Kaixin Li","user":"likaixin","type":"user"},{"_id":"67f6069f2aaf4ba399657926","avatarUrl":"/avatars/4252d7c73b11771ce83c39ab0e6bd8e4.svg","isPro":false,"fullname":"Rusty","user":"Violetmaizi","type":"user"},{"_id":"6486dde1f74857df3f1a5828","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6486dde1f74857df3f1a5828/FgE80CpalBO5qqArdfwxA.jpeg","isPro":false,"fullname":"Shenzhi Wang","user":"shenzhi-wang","type":"user"},{"_id":"688c72c011ef3399b561dee7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/688c72c011ef3399b561dee7/puhgnTOAfZYetsC46hqGm.jpeg","isPro":false,"fullname":"BoxueYang","user":"Boxue","type":"user"},{"_id":"690853eafbacae3388ad3756","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/690853eafbacae3388ad3756/9S0UiiAIH3QgWbuub8Bdt.png","isPro":false,"fullname":"Jiayi Zhu","user":"foreverCuSO4","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1,"organization":{"_id":"64c8b5837fe12ecd0a7e92eb","name":"Qwen","fullname":"Qwen","avatar":"https://cdn-uploads.huggingface.co/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png"}}">
Papers
arxiv:2602.05400

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Published on Feb 5
· Submitted by
Xuan Ouyang
on Feb 11
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,

Abstract

OPUS is a dynamic data selection framework that improves pre-training efficiency by scoring data candidates based on optimizer-induced update projections in a stable proxy-derived target space, achieving superior performance with reduced computational overhead.

AI-generated summary

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

Community

Paper author Paper submitter

In this paper, we argue that LLM pre-training is entering a “data-wall” regime where readily available high-quality public text is approaching exhaustion, so progress must shift from more tokens to better tokens chosen at the right time. While most existing pipelines either (i) apply static, training-agnostic quality filters or (ii) use dynamic selection criteria defined in raw gradient space, modern LLMs are actually trained with adaptive optimizers like AdamW or Muon whose preconditioning reshapes the effective update direction—creating a fundamental mismatch between “how we score data” and “how training truly updates the model.” To bridge this gap, we introduce OPUS (Optimizer-induced Projected Utility Selection), a dynamic selection framework that defines data utility directly in the optimizer-induced update space: a sample is valuable insofar as its optimizer-shaped effective update aligns with the descent direction of a stable, high-quality target distribution (our proxy).

image

Concretely, OPUS operationalizes this idea through a principled objective, a scalable estimator, and a diversity-preserving selection rule. Our key contributions are: (1) an optimizer-aware utility for dynamic selection, with closed-form approximations for effective update directions under AdamW and Muon, aligning scoring with real training geometry; (2) BENCH-PROXY, an in-distribution proxy construction method that retrieves benchmark-aligned samples from the pre-training corpus to stabilize the target direction; (3) scalable utility estimation using the Ghost technique + CountSketch projections to avoid per-sample gradient materialization; and (4) Boltzmann sampling with redundancy control to prevent diversity collapse under non-stationary streams. Empirically, OPUS delivers strong data/compute efficiency: it reports only ~4.7% additional compute overhead for selection while achieving large gains across datasets, optimizers, and scales—including improved accuracy (+2.2% average over 10 benchmarks and 8× compute reduction in one highlighted setting), outperforming industrial static/dynamic baselines and even matching or exceeding much longer-token training in several regimes.

Main Results of OPUS

Overview

OPUS (Optimizer-induced Projected Utility Selection) is a dynamic data selection framework for LLM pre-training that aligns data selection with the optimizer's actual update geometry (supporting both AdamW and Muon optimizers). It achieves superior data efficiency with minimal computational overhead.

Key Quantitative Results

1. Pre-training from Scratch (FineWeb, 30B tokens)

Figure 1 & Table 3: OPUS outperforms all compute-matched baselines across model scales and optimizers:

  • GPT-2 XL (Muon): Achieves 41.75% average accuracy vs. 40.29% for random selection (1.46 point improvement), outperforming even the 60B-token random baseline (41.29%) while using only half the tokens.
  • GPT-2 Large (AdamW): Achieves 41.43% vs. 39.29% for random (2.14 point improvement).
  • Cross-optimizer consistency: OPUS achieves best compute-matched performance under both Muon (matrix preconditioning) and AdamW (diagonal preconditioning), validating that optimizer-aware selection matters.

2. Robustness to Data Quality (FineWeb-Edu)

Table 4: OPUS demonstrates remarkable efficiency even with lower-quality data:

  • When selecting from mid-quality data (score 3), OPUS matches or exceeds static baselines trained on high-quality data (scores 4-5).
  • GPT-2 XL (Muon): OPUS achieves 44.99% average accuracy when selecting from score-3 data, outperforming all baselines trained on the superior score-4/5 partition (best baseline: 42.59%).

3. Continued Pre-training Efficiency

Figure 5 & Figure 6: On Qwen3-8B-Base continued pre-training with SciencePedia:

  • OPUS achieves 6× data efficiency: Using only 0.5B tokens, OPUS outperforms full training with 3B tokens on science benchmarks (OlympicArena and SciAssess).
  • Superior performance in specialized domains (physics, chemistry, biology, medicine, materials science).

4. Computational Efficiency

Figure 7:

  • Overhead: Only 4.7% additional compute cost compared to random selection.
  • Contrast with naive dynamic selection implementations that incur >3.5× slowdown.
  • Achieved through Ghost technique + CountSketch projections.

Comparative Performance

vs. Static Methods

OPUS consistently outperforms industrial-level static filters:

  • QuRating, DSIR, DCLM-FastText, FineWeb-Edu, UltraFineweb (Table 3, Table 4)
  • Static methods suffer from training-agnostic heuristics; OPUS adapts to model state.

vs. Dynamic Methods

  • High-PPL (perplexity-based): OPUS beats by ~2% average accuracy.
  • GREATS: OPUS outperforms while being more scalable (GREATS assumes SGD geometry; OPUS handles adaptive optimizers correctly).

Ablation Insights

Table 7 & Table 8:

  • Boltzmann sampling (temperature τ=0.9) outperforms greedy top-k selection (41.75% vs. 40.49%), preventing diversity collapse.
  • Bench-Proxy (benchmark-aligned proxy) improves over standard proxy (41.75% vs. 41.03%).
  • Robust to hyperparameters: Works across buffer sizes (16-64), projection dimensions (4096-16384).

Qualitative Analysis

Appendix A: OPUS selects a more diverse mixture of documents (instructional content + general web text) compared to:

  • High-PPL: Concentrates on high-loss but potentially noisy samples.
  • QuRating: Extreme preference for "educational" patterns only.
  • Static filters: Fixed heuristics that don't adapt to training dynamics.

Summary

OPUS achieves 8× computation reduction on GPT-XL while improving accuracy by 2.2% over random selection. It is the first dynamic selection method that properly accounts for modern optimizer geometries (AdamW, Muon), enabling principled, scalable, and diverse data selection at every training iteration with only 4.7% overhead.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.05400 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.05400 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.05400 in a Space README.md to link it from this page.

Collections including this paper 11