Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM
[go: Go Back, main page]

@RefalMachine\n\t! Thanks for your interest!

\n

Our supervisors didn't allow us to publish the weights of the current model, unfortunately, because it should better fit our publishing standards and have less copyright issues.

\n

Good news, the second version of the model - trained once more time from scratch - is underway. Although it did require > 1T additional tokens to recover, it should be free of these (meaning, critical ones) issues now. We have also scaled the compute >16 times and it already yields better results both on the benchmarks and in subjective conversations.

\n

As for the RuBIN benchmark, at this moment you can access it by request. We are planning to publish it for all to test upon.

\n

Stay tuned! :3

\n","updatedAt":"2026-01-02T13:04:50.605Z","author":{"_id":"63177d85f957903db971a173","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665094764329-63177d85f957903db971a173.png","fullname":"Artem","name":"kabachuha","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":47,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9516400098800659},"editors":["kabachuha"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1665094764329-63177d85f957903db971a173.png"],"reactions":[{"reaction":"🔥","users":["RefalMachine"],"count":1}],"isReport":false,"parentCommentId":"6957afd5df172c4fd94e4b5f"}}]},{"id":"6963908cdf2a458239435ba5","author":{"_id":"63177d85f957903db971a173","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665094764329-63177d85f957903db971a173.png","fullname":"Artem","name":"kabachuha","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":47,"isUserFollowing":false},"createdAt":"2026-01-11T11:59:08.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@librarian-bot recommend","html":"

\n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-01-11T11:59:08.462Z","author":{"_id":"63177d85f957903db971a173","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665094764329-63177d85f957903db971a173.png","fullname":"Artem","name":"kabachuha","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":47,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["kabachuha"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1665094764329-63177d85f957903db971a173.png"],"reactions":[],"isReport":false}},{"id":"69972096e22a13ce5361cd5c","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-19T14:39:18.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages](https://huggingface.co/papers/2601.06395) (2026)\n* [Dicta-LM 3.0: Advancing The Frontier of Hebrew Sovereign LLMs](https://huggingface.co/papers/2602.02104) (2026)\n* [\\\"UberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset](https://huggingface.co/papers/2602.15210) (2026)\n* [Bielik 11B v3: Multilingual Large Language Model for European Languages](https://huggingface.co/papers/2601.11579) (2025)\n* [BYOL: Bring Your Own Language Into LLMs](https://huggingface.co/papers/2601.10804) (2026)\n* [TabiBERT: A Large-Scale ModernBERT Foundation Model and A Unified Benchmark for Turkish](https://huggingface.co/papers/2512.23065) (2025)\n* [Kakugo: Distillation of Low-Resource Languages into Small Language Models](https://huggingface.co/papers/2601.14051) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-19T14:39:18.604Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7103052139282227},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2512.21580","authors":[{"_id":"6953818489916ff627aa40d1","user":{"_id":"68a6fe2dc0b7bcb6cfa40451","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/7rZ73pHnSDB0Zb8urk47A.jpeg","isPro":false,"fullname":"Alexander Podolskiy","user":"RapscallionA","type":"user"},"name":"Alexander Podolskiy","status":"claimed_verified","statusLastChangedAt":"2025-12-31T20:55:57.097Z","hidden":false},{"_id":"6953818489916ff627aa40d2","name":"Semen Molokov","hidden":false},{"_id":"6953818489916ff627aa40d3","name":"Timofey Gerasin","hidden":false},{"_id":"6953818489916ff627aa40d4","name":"Maksim Titov","hidden":false},{"_id":"6953818489916ff627aa40d5","name":"Alexey Rukhovich","hidden":false},{"_id":"6953818489916ff627aa40d6","user":{"_id":"63177d85f957903db971a173","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665094764329-63177d85f957903db971a173.png","isPro":false,"fullname":"Artem","user":"kabachuha","type":"user"},"name":"Artem Khrapov","status":"claimed_verified","statusLastChangedAt":"2025-12-31T20:55:59.534Z","hidden":false},{"_id":"6953818489916ff627aa40d7","name":"Kirill Morozov","hidden":false},{"_id":"6953818489916ff627aa40d8","name":"Evgeny Tetin","hidden":false},{"_id":"6953818489916ff627aa40d9","name":"Constantine Korikov","hidden":false},{"_id":"6953818489916ff627aa40da","name":"Pavel Efimov","hidden":false},{"_id":"6953818489916ff627aa40db","name":"Polina Lazukova","hidden":false},{"_id":"6953818489916ff627aa40dc","name":"Yuliya Skripkar","hidden":false},{"_id":"6953818489916ff627aa40dd","name":"Nikita Okhotnikov","hidden":false},{"_id":"6953818489916ff627aa40de","name":"Irina Piontkovskaya","hidden":false},{"_id":"6953818489916ff627aa40df","name":"Meng Xiaojun","hidden":false},{"_id":"6953818489916ff627aa40e0","name":"Zou Xueyi","hidden":false},{"_id":"6953818489916ff627aa40e1","name":"Zhang Zhenhe","hidden":false}],"publishedAt":"2025-12-25T08:52:23.000Z","title":"Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM","summary":"We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).","upvotes":8,"discussionId":"6953818489916ff627aa40e2"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"679770b6cbb6655a3c93eb43","avatarUrl":"/avatars/5bbccf36af7d4dae2028079b95692f94.svg","isPro":false,"fullname":"Areg Barseghyan","user":"aregbars","type":"user"},{"_id":"6320782d8a09d76e80def193","avatarUrl":"/avatars/56aa9bacad777bcd7bf7edaa26a3e6e2.svg","isPro":false,"fullname":"Gregory Polyakov","user":"polgrisha","type":"user"},{"_id":"6167fcea081d4f73c3ae4a3d","avatarUrl":"/avatars/db626a81169b1efd188dfb749095ca3f.svg","isPro":false,"fullname":"Alexander Podolskiy","user":"APodolskiy","type":"user"},{"_id":"65ea03c4e36ad838b818561a","avatarUrl":"/avatars/d8c11f853dbb0dad4767a5601f89ffef.svg","isPro":false,"fullname":"tim","user":"timo13113","type":"user"},{"_id":"63177d85f957903db971a173","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665094764329-63177d85f957903db971a173.png","isPro":false,"fullname":"Artem","user":"kabachuha","type":"user"},{"_id":"652cedbdf120598322ae358a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652cedbdf120598322ae358a/RrxrP0gtQus4SfNwfyAg_.jpeg","isPro":false,"fullname":"Mikhail","user":"RefalMachine","type":"user"},{"_id":"630920925a5c889aaedc7f33","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630920925a5c889aaedc7f33/w00N19M21l2FXe6ZasSYc.jpeg","isPro":false,"fullname":"Kristaller486","user":"kristaller486","type":"user"},{"_id":"660fd34df03515e4ff3f2b64","avatarUrl":"/avatars/0c2a29b1081ece881234acdd8ef9371a.svg","isPro":false,"fullname":"Georgii Aparin","user":"Egorgij21","type":"user"}],"acceptLanguages":["*"]}">
Papers
arxiv:2512.21580

Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM

Published on Dec 25, 2025
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).

Community

Hello! Great work, the new Russian pretrain is inspiring! I wanted to know if you plan to release: 1. the model 2. the Rubin dataset?

·
Paper author
edited Jan 2

Hello, @RefalMachine ! Thanks for your interest!

Our supervisors didn't allow us to publish the weights of the current model, unfortunately, because it should better fit our publishing standards and have less copyright issues.

Good news, the second version of the model - trained once more time from scratch - is underway. Although it did require > 1T additional tokens to recover, it should be free of these (meaning, critical ones) issues now. We have also scaled the compute >16 times and it already yields better results both on the benchmarks and in subjective conversations.

As for the RuBIN benchmark, at this moment you can access it by request. We are planning to publish it for all to test upon.

Stay tuned! :3

Paper author

@librarian-bot recommend

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.21580 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.21580 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.21580 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.