Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Large-Scale Data Selection for Instruction Tuning
[go: Go Back, main page]

https://github.com/hamishivi/automated-instruction-selection
Models & Data: https://huggingface.co/collections/hamishivi/large-scale-data-selection-for-instruction-tuning-677d7e8ca0295426c1915930

\n","updatedAt":"2025-03-04T04:44:06.112Z","author":{"_id":"62608fc2ffe8827cb1d89f9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png","fullname":"Hamish Ivison","name":"hamishivi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5967816114425659},"editors":["hamishivi"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png"],"reactions":[],"isReport":false}},{"id":"67c8fc4192e6298c24dec7da","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-03-06T01:37:05.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [The Best Instruction-Tuning Data are Those That Fit](https://huggingface.co/papers/2502.04194) (2025)\n* [CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom](https://huggingface.co/papers/2503.01836) (2025)\n* [Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder](https://huggingface.co/papers/2502.14050) (2025)\n* [Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities](https://huggingface.co/papers/2501.12147) (2025)\n* [Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm](https://huggingface.co/papers/2503.02359) (2025)\n* [Efficient Response Generation Method Selection for Fine-Tuning Large Language Models](https://huggingface.co/papers/2502.11779) (2025)\n* [Data Valuation using Neural Networks for Efficient Instruction Fine-Tuning](https://huggingface.co/papers/2502.09969) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-03-06T01:37:05.914Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7104262113571167},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.01807","authors":[{"_id":"67c67ff6dec55d10cb10fc9e","user":{"_id":"62608fc2ffe8827cb1d89f9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png","isPro":false,"fullname":"Hamish Ivison","user":"hamishivi","type":"user"},"name":"Hamish Ivison","status":"claimed_verified","statusLastChangedAt":"2025-03-04T08:40:13.649Z","hidden":false},{"_id":"67c67ff6dec55d10cb10fc9f","user":{"_id":"61cc2cf4dcb47bd5ed3cd3b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cc2cf4dcb47bd5ed3cd3b8/O-tDNhSPZCZQSQG_BaEG9.png","isPro":false,"fullname":"Muru Zhang","user":"nanami","type":"user"},"name":"Muru Zhang","status":"admin_assigned","statusLastChangedAt":"2025-03-04T11:14:59.402Z","hidden":false},{"_id":"67c67ff6dec55d10cb10fca0","user":{"_id":"65282b8d578679aac7888aec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65282b8d578679aac7888aec/dibBkhH-z1c70mJZZxJ7u.jpeg","isPro":false,"fullname":"Faeze Brahman","user":"faezeb","type":"user"},"name":"Faeze Brahman","status":"admin_assigned","statusLastChangedAt":"2025-03-04T11:15:05.562Z","hidden":false},{"_id":"67c67ff6dec55d10cb10fca1","user":{"_id":"641b4263abfce26bcf7b27de","avatarUrl":"/avatars/e91b4205e4f74b0dd8c333c23203a924.svg","isPro":false,"fullname":"Pang Wei Koh","user":"pangwei","type":"user"},"name":"Pang Wei Koh","status":"admin_assigned","statusLastChangedAt":"2025-03-04T11:15:14.558Z","hidden":false},{"_id":"67c67ff6dec55d10cb10fca2","user":{"_id":"6408fcc93461c51cf735a61e","avatarUrl":"/avatars/619f3653911d111f046a5a6c30fc8319.svg","isPro":false,"fullname":"Pradeep Dasigi","user":"pradeepd","type":"user"},"name":"Pradeep Dasigi","status":"admin_assigned","statusLastChangedAt":"2025-03-04T11:15:20.400Z","hidden":false}],"publishedAt":"2025-03-03T18:37:26.000Z","submittedOnDailyAt":"2025-03-04T02:14:06.105Z","title":"Large-Scale Data Selection for Instruction Tuning","submittedOnDailyBy":{"_id":"62608fc2ffe8827cb1d89f9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654027835241-62608fc2ffe8827cb1d89f9f.png","isPro":false,"fullname":"Hamish Ivison","user":"hamishivi","type":"user"},"summary":"Selecting high-quality training data from a larger pool is a crucial step\nwhen instruction-tuning language models, as carefully curated datasets often\nproduce models that outperform those trained on much larger, noisier datasets.\nAutomated data selection approaches for instruction-tuning are typically tested\nby selecting small datasets (roughly 10k samples) from small pools (100-200k\nsamples). However, popular deployed instruction-tuned models often train on\nhundreds of thousands to millions of samples, subsampled from even larger data\npools. We present a systematic study of how well data selection methods scale\nto these settings, selecting up to 2.5M samples from pools of up to 5.8M\nsamples and evaluating across 7 diverse tasks. We show that many recently\nproposed methods fall short of random selection in this setting (while using\nmore compute), and even decline in performance when given access to larger\npools of data to select over. However, we find that a variant of\nrepresentation-based data selection (RDS+), which uses weighted mean pooling of\npretrained LM hidden states, consistently outperforms more complex methods\nacross all settings tested -- all whilst being more compute-efficient. Our\nfindings highlight that the scaling properties of proposed automated selection\nmethods should be more closely examined. We release our code, data, and models\nat https://github.com/hamishivi/automated-instruction-selection.","upvotes":14,"discussionId":"67c67ff9dec55d10cb10fcef","projectPage":"https://huggingface.co/collections/hamishivi/large-scale-data-selection-for-instruction-tuning-677d7e8ca0295426c1915930","githubRepo":"https://github.com/hamishivi/automated-instruction-selection","githubRepoAddedBy":"auto","ai_summary":"A study on the scalability of data selection methods in instruction-tuning large language models reveals that a simpler, compute-efficient variant of representation-based data selection outperforms more complex methods.","ai_keywords":["instruction-tuning","language models","automated data selection","representation-based data selection","pretrained LM hidden states","weighted mean pooling","compute-efficiency"],"githubStars":52},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f612b934b8ac9ffa44f084","avatarUrl":"/avatars/6836c122e19c66c90f1673f28b30d7f0.svg","isPro":false,"fullname":"Tang","user":"tommysally","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"65b976fdf69f4d0377aef3fe","avatarUrl":"/avatars/1201194e2956c56b50098cc465a04c11.svg","isPro":false,"fullname":"Chau Minh Pham","user":"chtmp223","type":"user"},{"_id":"665b133508d536a8ac804f7d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Uwi0OnANdTbRbHHQvGqvR.png","isPro":false,"fullname":"Paulson","user":"Pnaomi","type":"user"},{"_id":"61cc2cf4dcb47bd5ed3cd3b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cc2cf4dcb47bd5ed3cd3b8/O-tDNhSPZCZQSQG_BaEG9.png","isPro":false,"fullname":"Muru Zhang","user":"nanami","type":"user"},{"_id":"63082bb7bc0a2a5ee2253523","avatarUrl":"/avatars/6cf8d12d16d15db1070fbea89b5b3967.svg","isPro":false,"fullname":"Kuo-Hsin Tu","user":"dapumptu","type":"user"},{"_id":"64d4615cf8082bf19b916492","avatarUrl":"/avatars/8e1b59565ec5e4b31090cf1b911781b9.svg","isPro":false,"fullname":"wongyukim","user":"wongyukim","type":"user"},{"_id":"64d89d015900b6d11116dab0","avatarUrl":"/avatars/fc80ff8df515b9cb4de48ec894539ed1.svg","isPro":false,"fullname":"Zhiyuan Ning","user":"nzynzy","type":"user"},{"_id":"645523ce1a543cf97b1dbdcd","avatarUrl":"/avatars/67af5c6846e4ddfdefa2bb344336185d.svg","isPro":false,"fullname":"Tim Dingman","user":"tdingman","type":"user"},{"_id":"63f0760ef1a47aaea5be1e6b","avatarUrl":"/avatars/795178a2d7f92150a6d1796f288c2f05.svg","isPro":false,"fullname":"Ji-Xiang","user":"Ji-Xiang","type":"user"},{"_id":"659fd7520183046e16c26a36","avatarUrl":"/avatars/90c9c231ef5279ea9f687e15887ed25d.svg","isPro":false,"fullname":"aaa","user":"qwertyuiopasdfg","type":"user"},{"_id":"64bbe9b236eb058cd9d6a5b9","avatarUrl":"/avatars/c7c01a3fa8809e73800392679abff6d5.svg","isPro":false,"fullname":"Kai Zuberbühler","user":"kaizuberbuehler","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2503.01807

Large-Scale Data Selection for Instruction Tuning

Published on Mar 3, 2025
· Submitted by
Hamish Ivison
on Mar 4, 2025

Abstract

A study on the scalability of data selection methods in instruction-tuning large language models reveals that a simpler, compute-efficient variant of representation-based data selection outperforms more complex methods.

AI-generated summary

Selecting high-quality training data from a larger pool is a crucial step when instruction-tuning language models, as carefully curated datasets often produce models that outperform those trained on much larger, noisier datasets. Automated data selection approaches for instruction-tuning are typically tested by selecting small datasets (roughly 10k samples) from small pools (100-200k samples). However, popular deployed instruction-tuned models often train on hundreds of thousands to millions of samples, subsampled from even larger data pools. We present a systematic study of how well data selection methods scale to these settings, selecting up to 2.5M samples from pools of up to 5.8M samples and evaluating across 7 diverse tasks. We show that many recently proposed methods fall short of random selection in this setting (while using more compute), and even decline in performance when given access to larger pools of data to select over. However, we find that a variant of representation-based data selection (RDS+), which uses weighted mean pooling of pretrained LM hidden states, consistently outperforms more complex methods across all settings tested -- all whilst being more compute-efficient. Our findings highlight that the scaling properties of proposed automated selection methods should be more closely examined. We release our code, data, and models at https://github.com/hamishivi/automated-instruction-selection.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 5

Browse 5 models citing this paper

Datasets citing this paper 16

Browse 16 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.01807 in a Space README.md to link it from this page.

Collections including this paper 7