Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
[go: Go Back, main page]

https://github.com/sail-sg/scaling-with-vocab
Collection: https://huggingface.co/collections/sail/scaling-laws-with-vocabulary-6699e0cbd77a8b2870859bfe

\n","updatedAt":"2024-07-19T04:00:05.074Z","author":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","fullname":"Qian Liu","name":"SivilTaram","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":107,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8068556189537048},"editors":["SivilTaram"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg"],"reactions":[],"isReport":false}},{"id":"669b13704ac5ad0250b41234","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-07-20T01:31:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Are Protein Language Models Compute Optimal?](https://huggingface.co/papers/2406.07249) (2024)\n* [Resolving Discrepancies in Compute-Optimal Scaling of Language Models](https://huggingface.co/papers/2406.19146) (2024)\n* [Scaling Laws for Linear Complexity Language Models](https://huggingface.co/papers/2406.16690) (2024)\n* [Large Vocabulary Size Improves Large Language Models](https://huggingface.co/papers/2406.16508) (2024)\n* [Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe](https://huggingface.co/papers/2406.04165) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-07-20T01:31:28.816Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7441766262054443},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"669d70ff6582d2ef70404892","author":{"_id":"60bccec062080d33f875cd0c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bccec062080d33f875cd0c/KvEhYxx9-Tff_Qb7PsjAL.png","fullname":"Peter Szemraj","name":"pszemraj","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":145,"isUserFollowing":false},"createdAt":"2024-07-21T20:35:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Maybe I'm missing something, but how/on what were the tokenizers trained? I see that they are BPE tokenizers with varying vocab size but I don't see much more detail than that. any insights on what was the diff/added to the vocab as the vocab size increased?","html":"

Maybe I'm missing something, but how/on what were the tokenizers trained? I see that they are BPE tokenizers with varying vocab size but I don't see much more detail than that. any insights on what was the diff/added to the vocab as the vocab size increased?

\n","updatedAt":"2024-07-21T20:35:11.734Z","author":{"_id":"60bccec062080d33f875cd0c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bccec062080d33f875cd0c/KvEhYxx9-Tff_Qb7PsjAL.png","fullname":"Peter Szemraj","name":"pszemraj","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":145,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9939596652984619},"editors":["pszemraj"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/60bccec062080d33f875cd0c/KvEhYxx9-Tff_Qb7PsjAL.png"],"reactions":[],"isReport":false},"replies":[{"id":"669daa8d5bc23a0628320bc4","author":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","fullname":"Qian Liu","name":"SivilTaram","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":107,"isUserFollowing":false},"createdAt":"2024-07-22T00:40:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @pszemraj , thanks for your insightful comments! We train the tokenizer on the SlimPajama corpus (the same corpus as our training). We will upload different tokenizers later for others' reference. Thanks for the reminder!","html":"

Hi \n\n@pszemraj\n\t , thanks for your insightful comments! We train the tokenizer on the SlimPajama corpus (the same corpus as our training). We will upload different tokenizers later for others' reference. Thanks for the reminder!

\n","updatedAt":"2024-07-22T00:40:45.947Z","author":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","fullname":"Qian Liu","name":"SivilTaram","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":107,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8715248703956604},"editors":["SivilTaram"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"669d70ff6582d2ef70404892"}},{"id":"66a29d5120c45f485c9d6e65","author":{"_id":"60bccec062080d33f875cd0c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bccec062080d33f875cd0c/KvEhYxx9-Tff_Qb7PsjAL.png","fullname":"Peter Szemraj","name":"pszemraj","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":145,"isUserFollowing":false},"createdAt":"2024-07-25T18:45:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"understood, thank you. Makes sense and thanks for the great work!","html":"

understood, thank you. Makes sense and thanks for the great work!

\n","updatedAt":"2024-07-25T18:45:37.061Z","author":{"_id":"60bccec062080d33f875cd0c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bccec062080d33f875cd0c/KvEhYxx9-Tff_Qb7PsjAL.png","fullname":"Peter Szemraj","name":"pszemraj","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":145,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9523531198501587},"editors":["pszemraj"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/60bccec062080d33f875cd0c/KvEhYxx9-Tff_Qb7PsjAL.png"],"reactions":[],"isReport":false,"parentCommentId":"669d70ff6582d2ef70404892"}},{"id":"66b1965fc3904fa63a30491e","author":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","fullname":"Qian Liu","name":"SivilTaram","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":107,"isUserFollowing":false},"createdAt":"2024-08-06T03:19:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @pszemraj , we have released all the trained tokenizers at https://huggingface.co/sail/scaling-with-vocab-trained-tokenizers. Let me know if you have follow-up questions!","html":"

Hi \n\n@pszemraj\n\t , we have released all the trained tokenizers at https://huggingface.co/sail/scaling-with-vocab-trained-tokenizers. Let me know if you have follow-up questions!

\n","updatedAt":"2024-08-06T03:19:59.877Z","author":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","fullname":"Qian Liu","name":"SivilTaram","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":107,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8751820921897888},"editors":["SivilTaram"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"669d70ff6582d2ef70404892"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2407.13623","authors":[{"_id":"6699cc6534b724c13d591358","user":{"_id":"60fc2fcca6bdebbe52dfdaf4","avatarUrl":"/avatars/1d59a7f33cb0df04678516f337e6b881.svg","isPro":false,"fullname":"Chaofan Tao","user":"tcftrees","type":"user"},"name":"Chaofan Tao","status":"claimed_verified","statusLastChangedAt":"2025-11-24T08:01:27.063Z","hidden":false},{"_id":"6699cc6534b724c13d591359","user":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","isPro":false,"fullname":"Qian Liu","user":"SivilTaram","type":"user"},"name":"Qian Liu","status":"claimed_verified","statusLastChangedAt":"2024-07-19T11:23:17.971Z","hidden":false},{"_id":"6699cc6534b724c13d59135a","user":{"_id":"6214e4ee1e35c843d42d1f88","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6214e4ee1e35c843d42d1f88/fj-9wuIdPhvogh3BrcXTB.jpeg","isPro":false,"fullname":"Longxu Dou","user":"dreamerdeo","type":"user"},"name":"Longxu Dou","status":"claimed_verified","statusLastChangedAt":"2024-07-19T11:23:15.966Z","hidden":false},{"_id":"6699cc6534b724c13d59135b","name":"Niklas Muennighoff","hidden":false},{"_id":"6699cc6534b724c13d59135c","name":"Zhongwei Wan","hidden":false},{"_id":"6699cc6534b724c13d59135d","name":"Ping Luo","hidden":false},{"_id":"6699cc6534b724c13d59135e","name":"Min Lin","hidden":false},{"_id":"6699cc6534b724c13d59135f","name":"Ngai Wong","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/612ee6a7b960e78c6d2319d4/sKyhyYNp_gGThmbtb3GED.png"],"publishedAt":"2024-07-18T15:58:54.000Z","submittedOnDailyAt":"2024-07-19T01:08:51.333Z","title":"Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies","submittedOnDailyBy":{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","isPro":false,"fullname":"Qian Liu","user":"SivilTaram","type":"user"},"summary":"Research on scaling large language models (LLMs) has primarily focused on\nmodel parameters and training data size, overlooking the role of vocabulary\nsize. % Intuitively, larger vocabularies enable more efficient tokenization by\nrepresenting sentences with fewer tokens, but they also increase the risk of\nunder-fitting representations for rare tokens. We investigate how vocabulary\nsize impacts LLM scaling laws by training models ranging from 33M to 3B\nparameters on up to 500B characters with various vocabulary configurations. We\npropose three complementary approaches for predicting the compute-optimal\nvocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit\nof the loss function. Our approaches converge on the same result that the\noptimal vocabulary size depends on the available compute budget and that larger\nmodels deserve larger vocabularies. However, most LLMs use too small vocabulary\nsizes. For example, we predict that the optimal vocabulary size of Llama2-70B\nshould have been at least 216K, 7 times larger than its vocabulary of 32K. We\nvalidate our predictions empirically by training models with 3B parameters\nacross different FLOPs budgets. Adopting our predicted optimal vocabulary size\nconsistently improves downstream performance over commonly used vocabulary\nsizes. By increasing the vocabulary size from the conventional 32K to 43K, we\nimprove performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21\nFLOPs. Our work emphasizes the necessity of jointly considering model\nparameters and vocabulary size for efficient scaling.","upvotes":56,"discussionId":"6699cc6834b724c13d59144a","githubRepo":"https://github.com/sail-sg/scaling-with-vocab","githubRepoAddedBy":"auto","ai_summary":"Investigating the impact of vocabulary size on the scaling of large language models reveals that larger vocabularies improve performance when considering compute budgets, and current models often use suboptimal vocabulary sizes.","ai_keywords":["large language models","LLMs","tokenization","vocabulary size","IsoFLOPs analysis","derivative estimation","parametric fit","loss function","compute-optimal vocabulary size","ARC-Challenge"],"githubStars":89},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6621cea88850e38ffbb1854f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6621cea88850e38ffbb1854f/LeytEEjSwnnqB-zFN1Tgt.jpeg","isPro":false,"fullname":"Taki WU","user":"taki555","type":"user"},{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","isPro":false,"fullname":"Minbyul Jeong","user":"Minbyul","type":"user"},{"_id":"64bb4f6265b648b2df793ded","avatarUrl":"/avatars/fb384a47a4a2fa0c2c6319d0a8ade0b8.svg","isPro":false,"fullname":"Jiwon Kang","user":"ji1kang","type":"user"},{"_id":"64ca7c04710645aa7bdbbfff","avatarUrl":"/avatars/c12f4cb6dc1ff0010edb3ef4cfcccd7c.svg","isPro":false,"fullname":"Lize Pirenne","user":"Inversta","type":"user"},{"_id":"62cbeb2d72dfd24b86bdf977","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cbeb2d72dfd24b86bdf977/UcGYYSBNrCvPM5K9v-sro.png","isPro":false,"fullname":"Zengzhi Wang","user":"SinclairWang","type":"user"},{"_id":"62c0a2e8564b51e080d64af8","avatarUrl":"/avatars/7ffed6712ead59919832ec71c0e3f5d1.svg","isPro":true,"fullname":"Ziqian Zhong","user":"fjzzq2002","type":"user"},{"_id":"64db5f5dd68a6ddcc7bd89e9","avatarUrl":"/avatars/69375ec915927b855813df8a6d486837.svg","isPro":false,"fullname":"Shengnan An","user":"ShengnanAn","type":"user"},{"_id":"6419d46b9a27800807c43fe3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6419d46b9a27800807c43fe3/H99LfQaSRU3c6uHHoGWPj.jpeg","isPro":false,"fullname":"MoonRide","user":"MoonRide","type":"user"},{"_id":"63107b18e87051f3e3e0f598","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63107b18e87051f3e3e0f598/R9onir4Y0MZuq1jEWCZ2-.jpeg","isPro":false,"fullname":"Unchun Yang","user":"ucyang","type":"user"},{"_id":"6274a2315d12b3a734adebc9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6274a2315d12b3a734adebc9/zLQDbszAvWh0F2BjdKock.jpeg","isPro":false,"fullname":"Xiaosen Zheng","user":"xszheng2020","type":"user"},{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","isPro":false,"fullname":"Qian Liu","user":"SivilTaram","type":"user"},{"_id":"60bccec062080d33f875cd0c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bccec062080d33f875cd0c/KvEhYxx9-Tff_Qb7PsjAL.png","isPro":true,"fullname":"Peter Szemraj","user":"pszemraj","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Papers
arxiv:2407.13623

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Published on Jul 18, 2024
· Submitted by
Qian Liu
on Jul 19, 2024
#1 Paper of the day
Authors:
,
,
,
,

Abstract

Investigating the impact of vocabulary size on the scaling of large language models reveals that larger vocabularies improve performance when considering compute budgets, and current models often use suboptimal vocabulary sizes.

AI-generated summary

Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. % Intuitively, larger vocabularies enable more efficient tokenization by representing sentences with fewer tokens, but they also increase the risk of under-fitting representations for rare tokens. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the same result that the optimal vocabulary size depends on the available compute budget and that larger models deserve larger vocabularies. However, most LLMs use too small vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work emphasizes the necessity of jointly considering model parameters and vocabulary size for efficient scaling.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Maybe I'm missing something, but how/on what were the tokenizers trained? I see that they are BPE tokenizers with varying vocab size but I don't see much more detail than that. any insights on what was the diff/added to the vocab as the vocab size increased?

·
Paper author

Hi @pszemraj , thanks for your insightful comments! We train the tokenizer on the SlimPajama corpus (the same corpus as our training). We will upload different tokenizers later for others' reference. Thanks for the reminder!

Sign up or log in to comment

Models citing this paper 5

Browse 5 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.13623 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 8