Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 sailor2 (Sailor2)
sailor2-pretrain-data-stage1: A comprehensive dataset comprising 450B tokens of high-quality data for continual pre-training, including English (from ProX), Chinese (from Chinese-Fineweb-Edu), Vietnamese, Indonesian, Thai, Malay, Burmese, Tagalog, and Khmer, organized by chunks
\n
sailor2-pretrain-data-stage2: An additional 60B tokens of exceptionally high-quality data for model annealing, including the above languages and Cebuano, Lao, Javanese, Waray, Sundanese, and Ilocano, organized by chunks
\n
community-dataset: Clean South-East Asia datasets sourced from the whole community members, including Indonesian, Thai, and Vietnamese content in fields like news, finance, law, books, poetry, social media, and TED Talks, organized by source
\n
sea-commoncrawl: Clean South-East Asia related web corpora from 89 CommonCrawl snapshots, organized by languages
sea-pdf-text: Clean pdf data, the PDF links are sourced from partner information, organized by languages
\n
sea-synthetic: Translation dataset from Cosmopedia across multiple languages, which is used to retreive the high-quality tokens for stage 2, organized by languages
\n
sea-commoncrawl-high-quality: the high-quality CommonCrawl subset, which is used in stage 2 of Sailor2 pre-training, organized by languages
If you find Sailor2 useful, please cite our work as follows:
\n
@article{sailor2report,\ntitle = {Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLM},\nauthor = {Longxu Dou and Qian Liu and Fan Zhou and Changyu Chen and Zili Wang and Ziqi Jin and Zichen Liu and Tongyao Zhu and Cunxiao Du and Penghui Yang and Haonan Wang and Jiaheng Liu and Yongchi Zhao and Xiachong Feng and Xin Mao and Man Tsung Yeung and Kunat Pipatanakul and Fajri Koto and Min Si Thu and Hynek Kydl{\\'\\i}{\\v{c}}ek and Zeyi Liu and Qunshu Lin and Sittipong Sripaisarnmongkol and Kridtaphad Sae-Khow and Nirattisai Thongchim and Taechawat Konkaew and Narong Borijindargoon and Anh Dao and Matichon Maneegard and Phakphum Artkaew and Zheng-Xin Yong and Quan Nguyen and Wannaphong Phatthiyaphaibun and Hoang H. Tran and Mike Zhang and Shiqi Chen and Tianyu Pang and Chao Du and Xinyi Wan and Wei Lu and Min Lin},\njournal={arXiv preprint arXiv:2502.12982},\nyear = {2025}\n}\n
\n\n\n\n\n","classNames":"hf-sanitized hf-sanitized-9938_zAlFZO7zat-BG6uY"},"users":[{"_id":"612ee6a7b960e78c6d2319d4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/2Hu9BaAyXbyh1vt0v1Qui.jpeg","isPro":false,"fullname":"Qian Liu","user":"SivilTaram","type":"user"},{"_id":"61e4c4ca1ab24785ac11ba69","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61e4c4ca1ab24785ac11ba69/1Q1zhhyGSJ9RJG9MzwxVv.jpeg","isPro":false,"fullname":"Binyuan Hui","user":"huybery","type":"user"},{"_id":"6214e4ee1e35c843d42d1f88","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6214e4ee1e35c843d42d1f88/fj-9wuIdPhvogh3BrcXTB.jpeg","isPro":false,"fullname":"Longxu Dou","user":"dreamerdeo","type":"user"},{"_id":"60d33fbbd7b174177faabd4f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d33fbbd7b174177faabd4f/pfyv_xj2B2m2N4F4sT9zJ.jpeg","isPro":true,"fullname":"Mike Zhang","user":"jjzha","type":"user"},{"_id":"5f1eb362eec0ad2a071ad6e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f1eb362eec0ad2a071ad6e2/IXMYkYKuTwn6kBdWnQeeY.png","isPro":false,"fullname":"Niklas Muennighoff","user":"Muennighoff","type":"user"},{"_id":"6545eca72fe2a1e68653fa84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6545eca72fe2a1e68653fa84/RZ1y_clOX01BWJeAsO7gK.jpeg","isPro":false,"fullname":"Makorn Nupakorn","user":"MGodK","type":"user"},{"_id":"6311194088942700629f4513","avatarUrl":"/avatars/c7b2aad98c8a6c14f001ba3d54576fe0.svg","isPro":false,"fullname":"Patteera","user":"Patt","type":"user"},{"_id":"60dc25da6155a8319f008a6f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1630322686754-60dc25da6155a8319f008a6f.jpeg","isPro":false,"fullname":"Wannaphong Phatthiyaphaibun","user":"wannaphong","type":"user"},{"_id":"620b21da1aa47b3f1517e443","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620b21da1aa47b3f1517e443/nwLbqgdWsf3W9vfGqv4RL.jpeg","isPro":false,"fullname":"Arthur Minovsky","user":"aminovsky","type":"user"},{"_id":"6170452ec22c7566454415ce","avatarUrl":"/avatars/daa04d74b86aed254def3c6b85d6b9ce.svg","isPro":false,"fullname":"Thanit Tativannarat ","user":"thanit456","type":"user"},{"_id":"63ff6038e7767a895335bd48","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ff6038e7767a895335bd48/oUQkCbAESh8z2pLX5Ig-i.jpeg","isPro":false,"fullname":"Min Si Thu","user":"jojo-ai-mst","type":"user"},{"_id":"631a4855300a072a8da70abd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631a4855300a072a8da70abd/jRnzdW5JBjICYKCmkUFI-.jpeg","isPro":false,"fullname":"phakphum artkaew","user":"pakphum","type":"user"},{"_id":"5ec6a7e8334ef26386e8e3f8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1590077397383-noauth.jpeg","isPro":false,"fullname":"Korakot Chaovavanich","user":"korakot","type":"user"},{"_id":"64264ce285f26ab94af3d379","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64264ce285f26ab94af3d379/rcfU_EIWcgdmaYvKXB_Yc.jpeg","isPro":false,"fullname":"Nguyen Phuoc Nguyen","user":"iamnguyen","type":"user"},{"_id":"64264c18c71d90951ea6052e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64264c18c71d90951ea6052e/STvdnmGLtqQITRLJjT81o.png","isPro":false,"fullname":"Lê Võ Quyết Thắng","user":"thangvip","type":"user"},{"_id":"63f312b2f4e30ffd2bdad8c8","avatarUrl":"/avatars/d925de2aee4d73f8f93e7628353d6b57.svg","isPro":false,"fullname":"Nguyen Nhi Thanh Tai","user":"tainguyen2010","type":"user"},{"_id":"63f9776ea8397941c5a9016a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f9776ea8397941c5a9016a/AhLZ6WXc2QYZQ5HVQoCaC.jpeg","isPro":false,"fullname":"Nguyen Minh Chi","user":"chillies","type":"user"},{"_id":"65184c2b2b4fffcb41ddd848","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65184c2b2b4fffcb41ddd848/URDlXcuYsDoqtv3yJGdl-.jpeg","isPro":false,"fullname":"Huy Vo","user":"sitloboi2012","type":"user"},{"_id":"657dad43476260623d9d74f3","avatarUrl":"/avatars/9b11df00b4a5f2c41a9a8e1286261675.svg","isPro":false,"fullname":"Natthanan Bhukan","user":"Natthanan","type":"user"},{"_id":"64b94ea0ee98fa835437db39","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b94ea0ee98fa835437db39/seUEO9QnNGkpu1RNJW6RM.jpeg","isPro":false,"fullname":"NA Nguyen","user":"AnhNguyen6688","type":"user"},{"_id":"607a5f7bc5e1e416b294ea55","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/607a5f7bc5e1e416b294ea55/cKXjgwGZb9G3uKNekNegb.png","isPro":false,"fullname":"Huang","user":"hungnm","type":"user"},{"_id":"65c19c3172fd754bba256112","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65c19c3172fd754bba256112/cfgYvKfO36PSTFYKQottM.jpeg","isPro":false,"fullname":"Ryan Tran","user":"ryanhoangt","type":"user"},{"_id":"6337f3e61718795719570051","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6337f3e61718795719570051/awChBPiyTI4d9c3-lHLY0.jpeg","isPro":true,"fullname":"Huynh Gia Bao","user":"baohuynhbk14","type":"user"},{"_id":"64d5e2fb95cf13a38187c0c3","avatarUrl":"/avatars/d6e03740c395141e97094330493e70c7.svg","isPro":false,"fullname":"dung","user":"nguyendung113","type":"user"},{"_id":"62b56eafa1bae3c711c208dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b56eafa1bae3c711c208dd/f8ge3DFAvxNUItguemGuk.jpeg","isPro":false,"fullname":"Hieu Ngo","user":"hiieu","type":"user"},{"_id":"63d54e47695c73ecef344ae3","avatarUrl":"/avatars/49562fab57c94ea7e864eadf0fee46f2.svg","isPro":false,"fullname":"Anh Dao","user":"Johnx69","type":"user"},{"_id":"645917af8aa54fb020f83351","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/645917af8aa54fb020f83351/g-g9TvDM42P2-YWwgcWuW.jpeg","isPro":false,"fullname":"Andrew Le","user":"nglebm19","type":"user"},{"_id":"622c0d4e8d04fd29a9ccf0cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1659974411736-622c0d4e8d04fd29a9ccf0cc.jpeg","isPro":false,"fullname":"Hieu Nguyen","user":"hieunguyen1053","type":"user"},{"_id":"64264bf285f26ab94af3cd50","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64264bf285f26ab94af3cd50/QfeLcnQH8Rcit80BwChNr.jpeg","isPro":false,"fullname":"Duong Trong Chi","user":"DuongTrongChi","type":"user"},{"_id":"6648b7efc88927e6dd1f890f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/xCxDtyszAQUCColskoodJ.jpeg","isPro":false,"fullname":"sxxndev","user":"sxxndev","type":"user"},{"_id":"630a5ef0e81e1dea2cedcec0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630a5ef0e81e1dea2cedcec0/ATtyCvYoX4z7uxsm2sJU2.png","isPro":false,"fullname":"Hà Huy Hoàng","user":"HoangHa","type":"user"},{"_id":"64d8deb1887f55fb6e5ae12d","avatarUrl":"/avatars/b85a5aba9c5630d8138d84a95915ef9a.svg","isPro":false,"fullname":"Vo Dinh Dat","user":"dinhdat1110","type":"user"},{"_id":"64ce53f444d373d70603513c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ce53f444d373d70603513c/Fmuo4VXU2VVNOnBIAyJpb.png","isPro":false,"fullname":"Nguyen Viet Cuong","user":"mrcuongnv","type":"user"},{"_id":"654c67b7eb360d17bee726ff","avatarUrl":"/avatars/24aae57fe772f89e163f5ec950144a1d.svg","isPro":false,"fullname":"Quy Quach","user":"wiequachisods","type":"user"},{"_id":"6374aef6ecbd6fa145a2644d","avatarUrl":"/avatars/f1046cc2d637d3f4fa0b36c3128171a3.svg","isPro":false,"fullname":"Hieu Pham","user":"HieuPM","type":"user"},{"_id":"66313bbe15d11cd08efc7252","avatarUrl":"/avatars/91589182d7de9444802bd101599eb510.svg","isPro":false,"fullname":"Minhhuy Le","user":"le2386","type":"user"},{"_id":"64e0e38d02fa032de4039c36","avatarUrl":"/avatars/0133948af97941d652464092e2c13cc9.svg","isPro":false,"fullname":"Nhut Tien","user":"gallantVN","type":"user"},{"_id":"62243664af5df9d9e5582f67","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62243664af5df9d9e5582f67/nAntUd0NVDcMYtwiunCU8.jpeg","isPro":false,"fullname":"Saksorn Ruangtanusak","user":"saksornr","type":"user"},{"_id":"655f4ff710e5c5fbef30fd97","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655f4ff710e5c5fbef30fd97/bI_KQnFof50HRqYBWpyu2.jpeg","isPro":false,"fullname":"Gabriel C","user":"gabrielchua","type":"user"},{"_id":"658a940ad861072dc54d15dc","avatarUrl":"/avatars/a180e61acad7a43b8ee2b004a5406793.svg","isPro":false,"fullname":"Anh Nguyen","user":"xuananh000","type":"user"},{"_id":"64d9a5fe8767727dff192cd8","avatarUrl":"/avatars/3ada9dbb82f16d3326e47f6126da43f0.svg","isPro":false,"fullname":"Chu Viet Quan","user":"chuquan282","type":"user"},{"_id":"65eb013a134cf8a29f0cd5e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65eb013a134cf8a29f0cd5e8/FyKjov-SmeDuCGWahqbVq.jpeg","isPro":false,"fullname":"Dinh Truong Phan","user":"dtruong46me","type":"user"},{"_id":"6377d814a2ec08932ef1dac5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6377d814a2ec08932ef1dac5/_6Qy-X1vKPGeNX9E8PdkC.jpeg","isPro":false,"fullname":"Pavaris Ruangchutiphophan","user":"Pavarissy","type":"user"},{"_id":"63690f557c5dd0caa7d4027e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63690f557c5dd0caa7d4027e/O7IKZgelOtHLZPELqeYjn.jpeg","isPro":false,"fullname":"Ngoc-Dung Nguyen","user":"dungnasa","type":"user"},{"_id":"628f6e5ab90dde28ef57d293","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/AxNzR2nvrND6Rf3RPkYMk.jpeg","isPro":false,"fullname":"Fan Zhou","user":"koalazf99","type":"user"},{"_id":"636f03f3a0d2db56c7f404f8","avatarUrl":"/avatars/4aa8505275567d63c41c8d0531f9061b.svg","isPro":false,"fullname":"Tuan Tran","user":"tuantm","type":"user"},{"_id":"644e60da1565b54e4a61b4e3","avatarUrl":"/avatars/200f4b75655c933c0b5176f7070b992e.svg","isPro":false,"fullname":"Alex Law","user":"AlexLaw830","type":"user"},{"_id":"630430583926de1f7ec62c6b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630430583926de1f7ec62c6b/mVQsL71KrGUs2H5hCTuO7.jpeg","isPro":true,"fullname":"Quan Nguyen","user":"qnguyen3","type":"user"},{"_id":"5fc134deea82dd667bb0ffa0","avatarUrl":"/avatars/c53a622ab70a2a06daa69db321fe4d78.svg","isPro":false,"fullname":"Pakawat Nakwijit","user":"imtk","type":"user"},{"_id":"64996f9c906b89ee66f2ec97","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64996f9c906b89ee66f2ec97/78xRALrHd7DK9t93Xic18.png","isPro":false,"fullname":"Arthit Suriyawongkul","user":"bact","type":"user"},{"_id":"65bd557430ed309cb8d9fdad","avatarUrl":"/avatars/cba0cb0136bef2f9e7a56d1041578315.svg","isPro":false,"fullname":"Quach","user":"phuclam","type":"user"},{"_id":"63688e10592ab54ed7fffada","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63688e10592ab54ed7fffada/3SO_5LhU6ogSRGBBpZcjm.jpeg","isPro":false,"fullname":"Ta Khanh Duy","user":"DuyTa","type":"user"},{"_id":"6426b028b1be6b90eee1b16c","avatarUrl":"/avatars/bc9c28513155575d74e9ddc3119768d5.svg","isPro":false,"fullname":"stevetran","user":"SteveTran","type":"user"},{"_id":"6447a1bc3e7b3c11be61dccd","avatarUrl":"/avatars/93d96f3910b5b264ec5a82dc25fd62a6.svg","isPro":false,"fullname":"Hoang Thanh Tung","user":"htt210","type":"user"},{"_id":"664b7f11c2d7217d403e4d76","avatarUrl":"/avatars/f1aa96c7d9f3bf78e5d1927714466a50.svg","isPro":false,"fullname":"Ahmad Nurjalal","user":"anurjalal","type":"user"},{"_id":"61c68619f9d527b465d43e25","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1640400354267-noauth.jpeg","isPro":false,"fullname":"Isaac Yong Wei Onn","user":"Isaac0804","type":"user"},{"_id":"63ce5d1a9815953d97e0e6fd","avatarUrl":"/avatars/0872c2bd9207f2d56743941d3930cb25.svg","isPro":true,"fullname":"Hung Tran","user":"hungtrv","type":"user"},{"_id":"64783d4b403cd7ae4b7880d1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64783d4b403cd7ae4b7880d1/ILuRb21FGz-ECBQcPKhgc.jpeg","isPro":false,"fullname":"Vietnam AI Hub","user":"VietnamAIHub","type":"user"},{"_id":"64759cac975bc84973202d23","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64759cac975bc84973202d23/krYoDPPI_SIG-GH6yjI1u.jpeg","isPro":false,"fullname":"Nghiem","user":"NghiemAbe","type":"user"},{"_id":"65239b82728c0b6dc750c8d3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/TxBrY78dXALqrHC80Kifw.jpeg","isPro":false,"fullname":"Nguyen Dinh Quy","user":"quynguyen1704","type":"user"},{"_id":"64f706cd96d7e4e686346aa4","avatarUrl":"/avatars/db7f98718a2322bb803cab6aeae09fad.svg","isPro":false,"fullname":"Pham Vu Tuan Dat","user":"datphamvn","type":"user"},{"_id":"6299cdaaf1f2a097fcaa303a","avatarUrl":"/avatars/1fb158f4bee5ba52450523ffe1eebd03.svg","isPro":false,"fullname":"Nachaphat Ainthong","user":"Nachaphat","type":"user"},{"_id":"6380ab16d95e4738fbd68798","avatarUrl":"/avatars/8637f0ecbd82f88a1629a8f590b18ded.svg","isPro":false,"fullname":"Chu Văn An","user":"ancv","type":"user"},{"_id":"61424bf4f0d914a5f606a823","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61424bf4f0d914a5f606a823/0td8lR4elBaVvJUD9Pojh.png","isPro":false,"fullname":"Yong Zheng-Xin","user":"yongzx","type":"user"},{"_id":"60ed74f536ceac2554083559","avatarUrl":"/avatars/28c184ef76f719a720d933d05afb5800.svg","isPro":false,"fullname":"taicheng guo","user":"taicheng","type":"user"},{"_id":"647374137afa69c3c7a665d5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/647374137afa69c3c7a665d5/V0m1AxM6IcsdZtOSRJJCj.png","isPro":false,"fullname":"Minh Le Duc","user":"StoicCodingLab","type":"user"},{"_id":"629ef6ea8af9491ad5181881","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629ef6ea8af9491ad5181881/8CbHMG9d6Pgv060Btn_Mq.jpeg","isPro":false,"fullname":"Huy Pham","user":"Aehus","type":"user"},{"_id":"6535fc14d690f3012e61dc55","avatarUrl":"/avatars/21ea3c38abc88eb06730f7f551a4dd39.svg","isPro":false,"fullname":"Arparwut","user":"Brighttbright","type":"user"},{"_id":"6632e00de7025bf3a33075b7","avatarUrl":"/avatars/e0ad1af0a3df6978588fb56c19137885.svg","isPro":false,"fullname":"Newsletter","user":"Chest848","type":"user"},{"_id":"64a3dd5878edd17f965660cf","avatarUrl":"/avatars/9a74114baf0f40f9b41e74c9e5a5520e.svg","isPro":false,"fullname":"Su Thanh Cong","user":"suthanhcong","type":"user"},{"_id":"664ee2f0cecf056c90c65d6b","avatarUrl":"/avatars/745dcaceba6ec501bb89ebee9543ab6e.svg","isPro":false,"fullname":"Liliane Duong","user":"lilianeduong","type":"user"},{"_id":"655b16cbcafc48de3670c3df","avatarUrl":"/avatars/b15c0f9c595aa317ba5f2c58632b0acc.svg","isPro":false,"fullname":"Pham Kinh Quoc","user":"phamkinhquoc2002","type":"user"},{"_id":"6551f1bd1ac896152b75a75a","avatarUrl":"/avatars/e23b940e38c2fdb57e397ae9e736ecf7.svg","isPro":false,"fullname":"Truong Bien","user":"bieenr","type":"user"},{"_id":"646730e6696e7355f5d30859","avatarUrl":"/avatars/751babb3f1127a1c00a9ef4384d9f172.svg","isPro":false,"fullname":"Hoang Minh An","user":"anhalu","type":"user"},{"_id":"662c8b82b7c202c084416568","avatarUrl":"/avatars/49a27f46be991889855c83f48896ef6f.svg","isPro":false,"fullname":"Toan Ly","user":"toan-ly","type":"user"},{"_id":"664f0ea07a074eb991f7c1d4","avatarUrl":"/avatars/a5087dba1d1fb3415af145bcddbc948a.svg","isPro":false,"fullname":"Nguyen Phuc Nguyen","user":"nguyennp86","type":"user"},{"_id":"6264f9d8e25d2b490db5e52a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1650784684687-noauth.jpeg","isPro":false,"fullname":"Nguyen Xuan Dat","user":"biwako","type":"user"},{"_id":"655844e5ed8df83128f51220","avatarUrl":"/avatars/d66fc5668c2633e9308bd25e6dbcd524.svg","isPro":false,"fullname":"Bui","user":"TrungBui59","type":"user"},{"_id":"664a9e9793f8f0084f969f0c","avatarUrl":"/avatars/6a48b654f18cef000745f5b4dffd809a.svg","isPro":false,"fullname":"Tung Nguyen","user":"tungnguyen21","type":"user"},{"_id":"662a07e31bead79388dfd958","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/662a07e31bead79388dfd958/w-oq5DVxWHs9msfazdL3_.jpeg","isPro":false,"fullname":"Wittawat Kitipatthavorn","user":"wttw","type":"user"},{"_id":"6651e4d8b24aab93e7d7bf1e","avatarUrl":"/avatars/92d7b9ebf4247b10ffa89c35a95c9bc0.svg","isPro":false,"fullname":"Jim Jiao","user":"tragedyofj","type":"user"},{"_id":"6316c449550b00d37dbec4af","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6316c449550b00d37dbec4af/d3F8YfvY3tYtGj59EiOL5.jpeg","isPro":false,"fullname":"Yvette Chen","user":"Huiyu","type":"user"},{"_id":"6650877fe57e70121088fa6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/HGNsDVqx86JA-5vCivttO.jpeg","isPro":false,"fullname":"Luu Duc","user":"Ducluu22","type":"user"},{"_id":"65dad0eac5f46bf23b54b066","avatarUrl":"/avatars/ec6d5586973b7e3dfb2203de1ce288ce.svg","isPro":false,"fullname":"Muhammad Dehan Al Kautsar","user":"dehanalkautsar","type":"user"},{"_id":"5f5c4d49e56d546cd623309b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1599884443706-noauth.jpeg","isPro":false,"fullname":"Samuel Cahyawijaya","user":"samuelcahyawijaya","type":"user"},{"_id":"652f721d7a8c08f81e6edfa3","avatarUrl":"/avatars/74a4a5a62e3b852861ab987d57a6eb26.svg","isPro":false,"fullname":"Panjapong Poobanchuen","user":"Khawpuneiei","type":"user"},{"_id":"5f5c4b20e56d546cd6233098","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637813888895-5f5c4b20e56d546cd6233098.jpeg","isPro":false,"fullname":"Genta Indra Winata","user":"gentaiscool","type":"user"},{"_id":"66588e926ce4ee9c9e39202a","avatarUrl":"/avatars/0223399552c4851b8b3f9ce206cbcd0e.svg","isPro":false,"fullname":"Aristo","user":"orangetheoran","type":"user"},{"_id":"5f5c4e7de56d546cd623309d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1599884906291-noauth.png","isPro":false,"fullname":"Bryan Wilie","user":"bryanwilie","type":"user"},{"_id":"61728a033edf4cc38a81237a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652231681579-61728a033edf4cc38a81237a.jpeg","isPro":false,"fullname":"Joanito Agili Lopo","user":"joanitolopo","type":"user"},{"_id":"634627a5ac1cb29fb2ac485f","avatarUrl":"/avatars/275ad360aacbd9956f09aa1dffff9fd7.svg","isPro":false,"fullname":"Faiz Ghifari Haznitrama","user":"haznitrama","type":"user"},{"_id":"646c85e354c7a20a99dcedec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/-Q1cNTWd63Ho9xabR2nPM.jpeg","isPro":false,"fullname":"Tran Nhan Khanh","user":"khartist29","type":"user"},{"_id":"65c100ea55b6309e986a6725","avatarUrl":"/avatars/462d9b44ec7dcda83ae2c9f6b3e1dcb5.svg","isPro":false,"fullname":"Naphat Khoprasertthaworn","user":"IAymi","type":"user"},{"_id":"61889e44a4224193caae1e04","avatarUrl":"/avatars/a49a3e708f86de7cebfd5821c6e76a88.svg","isPro":false,"fullname":"Cong Vo","user":"congvm","type":"user"},{"_id":"6177a42348ce96fe95f80e0a","avatarUrl":"/avatars/31dc066a5848ca3237c39f37f9265693.svg","isPro":false,"fullname":"Rifki Afina Putri","user":"rifkiaputri","type":"user"},{"_id":"60ad0de755f970745d4ec28d","avatarUrl":"/avatars/b0de0222b8ed5fdac8dc7cb0336d2ec7.svg","isPro":false,"fullname":"GtZeng","user":"chaoscodes","type":"user"},{"_id":"66555ad3c7ff1a0216aa67da","avatarUrl":"/avatars/4763869c80e5f8792bfe43ccda9cefec.svg","isPro":false,"fullname":"Do Nghiem Duc","user":"DanielsDo","type":"user"},{"_id":"61a4dc053205e107691e0d82","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61a4dc053205e107691e0d82/BESoEHlHYXstXudh6dOdT.jpeg","isPro":true,"fullname":"Alham Fikri Aji","user":"afaji","type":"user"}],"userCount":114,"collections":[{"slug":"sailor2/sailor2-models-674d472668f3816be3f231bc","title":"Sailor2 Models","description":"","gating":false,"lastUpdated":"2025-09-08T07:48:49.377Z","owner":{"_id":"6645728890aa6e103fa45cc4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/lPZCvi9En_2_mFJvqjvdo.jpeg","fullname":"Sailor2","name":"sailor2","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":149,"isUserFollowing":false},"items":[{"_id":"674f06b2b2dfa9e7dc5bc8aa","position":0,"type":"model","author":"sail","authorData":{"_id":"61f4e841c771e23a1abb61ff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643440185801-5df833bdda6d0311fd3d5403.png","fullname":"Sea AI Lab","name":"sail","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":122,"isUserFollowing":false},"downloads":208,"gated":false,"id":"sail/Sailor2-8B-Chat","availableInferenceProviders":[],"lastModified":"2025-02-20T09:40:00.000Z","likes":19,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":8546930176},{"_id":"674f06a1d8b3e255ec664889","position":1,"type":"model","author":"sail","authorData":{"_id":"61f4e841c771e23a1abb61ff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643440185801-5df833bdda6d0311fd3d5403.png","fullname":"Sea AI Lab","name":"sail","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":122,"isUserFollowing":false},"downloads":112,"gated":false,"id":"sail/Sailor2-8B","availableInferenceProviders":[],"lastModified":"2025-02-20T08:52:39.000Z","likes":8,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":8546930176},{"_id":"67bc3580c1cf2f4ab600c66a","position":2,"type":"model","author":"sail","authorData":{"_id":"61f4e841c771e23a1abb61ff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643440185801-5df833bdda6d0311fd3d5403.png","fullname":"Sea AI Lab","name":"sail","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":122,"isUserFollowing":false},"downloads":13,"gated":false,"id":"sail/Sailor2-20B-Chat","availableInferenceProviders":[],"lastModified":"2025-02-20T08:48:44.000Z","likes":7,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":19173020672},{"_id":"674f06a731cced4cfab34dc4","position":4,"type":"model","author":"sail","authorData":{"_id":"61f4e841c771e23a1abb61ff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643440185801-5df833bdda6d0311fd3d5403.png","fullname":"Sea AI Lab","name":"sail","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":122,"isUserFollowing":false},"downloads":24,"gated":false,"id":"sail/Sailor2-20B","availableInferenceProviders":[],"lastModified":"2025-04-29T09:29:04.000Z","likes":10,"pipeline_tag":"text-generation","private":false,"repoType":"model","isLikedByUser":false,"widgetOutputUrls":[],"numParameters":19173020672}],"position":0,"theme":"indigo","private":false,"shareUrl":"https://hf.co/collections/sailor2/sailor2-models","upvotes":5,"isUpvotedByUser":false},{"slug":"sailor2/sailor2-pre-training-datasets-674d473b3a4b7e31a1722913","title":"Sailor2 Pre-training Datasets","description":"","gating":false,"lastUpdated":"2025-09-24T12:47:03.031Z","owner":{"_id":"6645728890aa6e103fa45cc4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/lPZCvi9En_2_mFJvqjvdo.jpeg","fullname":"Sailor2","name":"sailor2","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":149,"isUserFollowing":false},"items":[{"_id":"674f063f31cced4cfab325f7","position":0,"type":"dataset","author":"sailor2","downloads":1179,"gated":false,"id":"sailor2/sailor2-pretrain-data-stage1","lastModified":"2024-12-04T08:00:33.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":295423615,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"_id":"674f06ccdcbba84f93b1c09e","position":1,"type":"dataset","author":"sailor2","downloads":4191,"gated":false,"id":"sailor2/sea-synthetic","lastModified":"2024-10-30T08:46:03.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":59770683,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"_id":"674f06353ab99cb1d800abf9","position":2,"type":"dataset","author":"sailor2","downloads":2417,"gated":false,"id":"sailor2/sea-commoncrawl","lastModified":"2024-12-04T08:10:42.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":493636618,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"_id":"674f0650d8b3e255ec6636b1","position":3,"type":"dataset","author":"sailor2","downloads":951,"gated":false,"id":"sailor2/sea-internet","lastModified":"2024-12-04T08:28:19.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":14165672,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":1,"isLikedByUser":false,"isBenchmark":false}],"position":1,"theme":"pink","private":false,"shareUrl":"https://hf.co/collections/sailor2/sailor2-pre-training-datasets","upvotes":4,"isUpvotedByUser":false},{"slug":"sailor2/sailor2-post-training-datasets-674f0661d8b3e255ec6638d8","title":"Sailor2 Post-training Datasets","description":"","gating":false,"lastUpdated":"2025-02-17T03:35:40.749Z","owner":{"_id":"6645728890aa6e103fa45cc4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/lPZCvi9En_2_mFJvqjvdo.jpeg","fullname":"Sailor2","name":"sailor2","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":149,"isUserFollowing":false},"items":[{"_id":"674f07bf4dc6ef93586d107b","position":0,"type":"dataset","author":"sailor2","downloads":11,"gated":false,"id":"sailor2/sailor2-sft-stage1","lastModified":"2024-12-04T08:16:35.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":2734042,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["json"],"modalities":["tabular","text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"_id":"674f07c5ba952be1c9ebaa25","position":1,"type":"dataset","author":"sailor2","downloads":11,"gated":false,"id":"sailor2/sailor2-sft-stage2","lastModified":"2024-12-02T06:31:32.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":530839,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["json"],"modalities":["tabular","text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"_id":"674f225222daa537b314d541","position":2,"type":"dataset","author":"sailor2","downloads":46,"gated":false,"id":"sailor2/sea-ultrafeedback","lastModified":"2024-11-16T11:30:06.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":58491,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"_id":"67b2ae8c0303a07acd3be7b9","position":3,"type":"dataset","author":"sailor2","downloads":21,"gated":false,"id":"sailor2/sea-ultrafeedback-onpolicy","lastModified":"2025-02-16T16:42:52.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":38327,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false}],"position":2,"theme":"blue","private":false,"shareUrl":"https://hf.co/collections/sailor2/sailor2-post-training-datasets","upvotes":5,"isUpvotedByUser":false},{"slug":"sailor2/sailor2-benchmarks-674f07078ec72e255ae424a1","title":"Sailor2 Benchmarks","description":"","gating":false,"lastUpdated":"2024-12-03T13:30:18.022Z","owner":{"_id":"6645728890aa6e103fa45cc4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/612ee6a7b960e78c6d2319d4/lPZCvi9En_2_mFJvqjvdo.jpeg","fullname":"Sailor2","name":"sailor2","type":"org","isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":149,"isUserFollowing":false},"items":[{"_id":"674f07129b45e72cc70e81cc","position":0,"type":"dataset","author":"sailor2","downloads":49,"gated":false,"id":"sailor2/sea-wildbench","lastModified":"2025-03-26T09:35:14.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":1024,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["tabular","text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false}],"position":3,"theme":"indigo","private":false,"shareUrl":"https://hf.co/collections/sailor2/sailor2-benchmarks","upvotes":2,"isUpvotedByUser":false}],"datasets":[{"author":"sailor2","downloads":49,"gated":false,"id":"sailor2/sea-wildbench","lastModified":"2025-03-26T09:35:14.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":1024,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["tabular","text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"sailor2","downloads":21,"gated":false,"id":"sailor2/sea-ultrafeedback-onpolicy","lastModified":"2025-02-16T16:42:52.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":38327,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["parquet"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"sailor2","downloads":12,"gated":false,"id":"sailor2/Flores-Plus-Evaluation-Log-Preview-Cleaned","lastModified":"2025-01-22T03:38:12.000Z","datasetsServerInfo":{"viewer":"viewer","numRows":153000,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"sailor2","downloads":72,"gated":false,"id":"sailor2/sea-pdf-text","lastModified":"2024-12-04T08:28:55.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":32368252,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":1,"isLikedByUser":false,"isBenchmark":false},{"author":"sailor2","downloads":951,"gated":false,"id":"sailor2/sea-internet","lastModified":"2024-12-04T08:28:19.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":14165672,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":1,"isLikedByUser":false,"isBenchmark":false},{"author":"sailor2","downloads":11,"gated":false,"id":"sailor2/sailor2-sft-stage1","lastModified":"2024-12-04T08:16:35.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":2734042,"libraries":["datasets","pandas","mlcroissant","polars"],"formats":["json"],"modalities":["tabular","text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"sailor2","downloads":2417,"gated":false,"id":"sailor2/sea-commoncrawl","lastModified":"2024-12-04T08:10:42.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":493636618,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"sailor2","downloads":40,"gated":false,"id":"sailor2/sailor2-pretrain-data-stage2","lastModified":"2024-12-04T08:04:05.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":51715771,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"sailor2","downloads":1179,"gated":false,"id":"sailor2/sailor2-pretrain-data-stage1","lastModified":"2024-12-04T08:00:33.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":295423615,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":0,"isLikedByUser":false,"isBenchmark":false},{"author":"sailor2","downloads":85,"gated":false,"id":"sailor2/community-dataset","lastModified":"2024-12-04T08:00:14.000Z","datasetsServerInfo":{"viewer":"viewer-partial","numRows":5167230,"libraries":["datasets","dask","mlcroissant"],"formats":["json"],"modalities":["text"]},"private":false,"repoType":"dataset","likes":1,"isLikedByUser":false,"isBenchmark":false}],"models":[],"paperPreviews":[],"spaces":[],"buckets":[],"numBuckets":0,"numDatasets":16,"numModels":0,"numSpaces":1,"lastOrgActivities":[{"time":"2026-02-04T19:06:05.987Z","user":"huybery","userAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61e4c4ca1ab24785ac11ba69/1Q1zhhyGSJ9RJG9MzwxVv.jpeg","type":"paper","paper":{"id":"2602.02361","title":"SWE-Universe: Scale Real-World Verifiable Environments to Millions","publishedAt":"2026-02-02T17:20:30.000Z","upvotes":60,"isUpvotedByUser":true}},{"time":"2026-01-27T14:40:03.193Z","user":"afaji","userAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61a4dc053205e107691e0d82/BESoEHlHYXstXudh6dOdT.jpeg","type":"paper","paper":{"id":"2601.17277","title":"PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues","publishedAt":"2026-01-24T03:31:08.000Z","upvotes":6,"isUpvotedByUser":true}},{"time":"2026-01-27T14:39:54.347Z","user":"gentaiscool","userAvatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1637813888895-5f5c4b20e56d546cd6233098.jpeg","type":"paper","paper":{"id":"2601.17277","title":"PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues","publishedAt":"2026-01-24T03:31:08.000Z","upvotes":6,"isUpvotedByUser":true}}],"acceptLanguages":["*"],"canReadRepos":false,"canReadSpaces":false,"blogPosts":[],"currentRepoPage":0,"filters":{},"paperView":false}">
The Sailor2 community is to build open large language models optimized for multiple South-East Asian languages, such as Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese. The model will be continually pre-trained on a base model proficient in both Chinese and English, and its performance is expected to be comparable to the most advanced business models for the above South-East Asian languages.
GitHub: All you need to know about using or fine-tuning Sailor2.
Sailor2-1B: 1B base model continually pre-trained on 500B tokens from Qwen2.5-0.5B with model expansion.
Sailor2-8B: 8B base model continually pre-trained on 500B tokens from Qwen2.5-7B with model expansion.
Sailor2-20B: 20B base model continually pre-trained on 500B tokens from Qwen2.5-14B with model expansion.
Sailor2-1B-Chat: 1B chat model after post-training on the 1B base model.
Sailor2-8B-Chat: 8B chat model after post-training on the 8B base model.
Sailor2-20B-Chat: 20B chat model after post-training on the 20B base model.
📚 Sailor2 Pre-training Dataset
sailor2-pretrain-data-stage1: A comprehensive dataset comprising 450B tokens of high-quality data for continual pre-training, including English (from ProX), Chinese (from Chinese-Fineweb-Edu), Vietnamese, Indonesian, Thai, Malay, Burmese, Tagalog, and Khmer, organized by chunks
sailor2-pretrain-data-stage2: An additional 60B tokens of exceptionally high-quality data for model annealing, including the above languages and Cebuano, Lao, Javanese, Waray, Sundanese, and Ilocano, organized by chunks
community-dataset: Clean South-East Asia datasets sourced from the whole community members, including Indonesian, Thai, and Vietnamese content in fields like news, finance, law, books, poetry, social media, and TED Talks, organized by source
sea-commoncrawl: Clean South-East Asia related web corpora from 89 CommonCrawl snapshots, organized by languages
sea-pdf-text: Clean pdf data, the PDF links are sourced from partner information, organized by languages
sea-synthetic: Translation dataset from Cosmopedia across multiple languages, which is used to retreive the high-quality tokens for stage 2, organized by languages
sea-commoncrawl-high-quality: the high-quality CommonCrawl subset, which is used in stage 2 of Sailor2 pre-training, organized by languages
📑 Sailor2 Post-training Dataset
sailor2-sft-stage1: 4M Medium-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
sailor2-sft-stage2: 400K High-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
sea-ultrafeedback: Preference optimization dataset, supports English, Chinese and 17 SEA languages.
🧐 Sailor2 Evaluation Benchmark
sea-wildbench: Chat model evaluation, supports 8 SEA languages.
If you find Sailor2 useful, please cite our work as follows:
@article{sailor2report,
title = {Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLM},
author = {Longxu Dou and Qian Liu and Fan Zhou and Changyu Chen and Zili Wang and Ziqi Jin and Zichen Liu and Tongyao Zhu and Cunxiao Du and Penghui Yang and Haonan Wang and Jiaheng Liu and Yongchi Zhao and Xiachong Feng and Xin Mao and Man Tsung Yeung and Kunat Pipatanakul and Fajri Koto and Min Si Thu and Hynek Kydl{\'\i}{\v{c}}ek and Zeyi Liu and Qunshu Lin and Sittipong Sripaisarnmongkol and Kridtaphad Sae-Khow and Nirattisai Thongchim and Taechawat Konkaew and Narong Borijindargoon and Anh Dao and Matichon Maneegard and Phakphum Artkaew and Zheng-Xin Yong and Quan Nguyen and Wannaphong Phatthiyaphaibun and Hoang H. Tran and Mike Zhang and Shiqi Chen and Tianyu Pang and Chao Du and Xinyi Wan and Wei Lu and Min Lin},
journal={arXiv preprint arXiv:2502.12982},
year = {2025}
}