Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - SAIL-VL2 Technical Report
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-09-19T01:34:10.258Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7092471122741699},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"68ce4649305901921c99028b","author":{"_id":"631e14ac473a6825f285e89d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg","fullname":"Yury Panikov","name":"panikov","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-09-20T06:14:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks!","html":"

Thanks!

\n","updatedAt":"2025-09-20T06:14:33.583Z","author":{"_id":"631e14ac473a6825f285e89d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg","fullname":"Yury Panikov","name":"panikov","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7788880467414856},"editors":["panikov"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/631e14ac473a6825f285e89d/K-6QnoeGLg8XFvbTMMdqA.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2509.14033","authors":[{"_id":"68cb69b55a7803ff3be42dcd","user":{"_id":"64bf7b4681caff7f18665b89","avatarUrl":"/avatars/035679676581e070e317ec09559d6003.svg","isPro":false,"fullname":"wjyin","user":"Blue-skyyy","type":"user"},"name":"Weijie Yin","status":"claimed_verified","statusLastChangedAt":"2025-09-21T13:13:15.236Z","hidden":false},{"_id":"68cb69b55a7803ff3be42dce","name":"Yongjie Ye","hidden":false},{"_id":"68cb69b55a7803ff3be42dcf","user":{"_id":"6491b1b2c1741666238f8a0f","avatarUrl":"/avatars/9246c9ef06d80bd8628426375c95d4be.svg","isPro":false,"fullname":"JackShu","user":"Shuhuhuhu","type":"user"},"name":"Fangxun Shu","status":"claimed_verified","statusLastChangedAt":"2025-09-18T13:26:50.598Z","hidden":false},{"_id":"68cb69b55a7803ff3be42dd0","name":"Yue Liao","hidden":false},{"_id":"68cb69b55a7803ff3be42dd1","name":"Zijian Kang","hidden":false},{"_id":"68cb69b55a7803ff3be42dd2","name":"Hongyuan Dong","hidden":false},{"_id":"68cb69b55a7803ff3be42dd3","user":{"_id":"64a02fbafab664ac3186f1fd","avatarUrl":"/avatars/ccda4d04490d5b456a51d523917e227a.svg","isPro":false,"fullname":"Haiyang Yu","user":"hyyu20","type":"user"},"name":"Haiyang Yu","status":"claimed_verified","statusLastChangedAt":"2025-09-18T13:26:48.039Z","hidden":false},{"_id":"68cb69b55a7803ff3be42dd4","name":"Dingkang Yang","hidden":false},{"_id":"68cb69b55a7803ff3be42dd5","user":{"_id":"64d201b1c2bd235422fb1d14","avatarUrl":"/avatars/e50581aa66391cedae94e116e759b9ec.svg","isPro":false,"fullname":"wang","user":"stormthunder","type":"user"},"name":"Jiacong Wang","status":"claimed_verified","statusLastChangedAt":"2025-09-18T13:26:53.059Z","hidden":false},{"_id":"68cb69b55a7803ff3be42dd6","name":"Han Wang","hidden":false},{"_id":"68cb69b55a7803ff3be42dd7","name":"Wenzhuo Liu","hidden":false},{"_id":"68cb69b55a7803ff3be42dd8","name":"Xiao Liang","hidden":false},{"_id":"68cb69b55a7803ff3be42dd9","name":"Shuicheng Yan","hidden":false},{"_id":"68cb69b55a7803ff3be42dda","name":"Chao Feng","hidden":false}],"publishedAt":"2025-09-17T14:34:02.000Z","submittedOnDailyAt":"2025-09-18T00:39:01.459Z","title":"SAIL-VL2 Technical Report","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM)\nfor comprehensive multimodal understanding and reasoning. As the successor to\nSAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B\nparameter scales across diverse image and video benchmarks, demonstrating\nstrong capabilities from fine-grained perception to complex reasoning. Three\ncore innovations drive its effectiveness. First, a large-scale data curation\npipeline with scoring and filtering strategies enhances both quality and\ndistribution across captioning, OCR, QA, and video data, improving training\nefficiency. Second, a progressive training framework begins with a powerful\npre-trained vision encoder (SAIL-ViT), advances through multimodal\npre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that\nsystematically strengthens model capabilities. Third, architectural advances\nextend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs.\nWith these contributions, SAIL-VL2 demonstrates competitive performance across\n106 datasets and achieves state-of-the-art results on challenging reasoning\nbenchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass\nleaderboard, SAIL-VL2-2B ranks first among officially released open-source\nmodels under the 4B parameter scale, while serving as an efficient and\nextensible foundation for the open-source multimodal community.","upvotes":44,"discussionId":"68cb69b55a7803ff3be42ddb","projectPage":"https://huggingface.co/BytedanceDouyinContent","ai_summary":"SAIL-VL2, a vision-language foundation model, achieves state-of-the-art performance across diverse benchmarks through data curation, progressive training, and sparse MoE architecture.","ai_keywords":["vision-language foundation model","SAIL-VL2","SAIL-VL","parameter scales","image and video benchmarks","fine-grained perception","complex reasoning","large-scale data curation","scoring and filtering strategies","training efficiency","progressive training framework","powerful pre-trained vision encoder","SAIL-ViT","multimodal pre-training","thinking-fusion SFT-RL hybrid paradigm","dense LLMs","efficient sparse Mixture-of-Experts","MoE designs","challenging reasoning benchmarks","MMMU","MathVista","OpenCompass leaderboard"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"665d4b515fdfe8f923e347a7","avatarUrl":"/avatars/d114b24c02dadfca0a8aee104755a8ec.svg","isPro":false,"fullname":"Zhaokai Wang","user":"wzk1015","type":"user"},{"_id":"64d201b1c2bd235422fb1d14","avatarUrl":"/avatars/e50581aa66391cedae94e116e759b9ec.svg","isPro":false,"fullname":"wang","user":"stormthunder","type":"user"},{"_id":"64bf7b4681caff7f18665b89","avatarUrl":"/avatars/035679676581e070e317ec09559d6003.svg","isPro":false,"fullname":"wjyin","user":"Blue-skyyy","type":"user"},{"_id":"66925c0e3774cc5e5c958b6b","avatarUrl":"/avatars/ec744282d898b55aadc7ab302aa7f719.svg","isPro":false,"fullname":"Peng Chen","user":"VoidChan","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"666a5cd9148d701ff8340e8c","avatarUrl":"/avatars/8536b92af349e7b5f22d6b5bd67f77a1.svg","isPro":false,"fullname":"Yue Liao","user":"yliao","type":"user"},{"_id":"663dbd85ac125e17058adaee","avatarUrl":"/avatars/877e99de5059a91cad56e0eed500e691.svg","isPro":false,"fullname":"Zihan Ding","user":"dingzihan737","type":"user"},{"_id":"675a9a9972aadf30cae14112","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/675a9a9972aadf30cae14112/Z_u-LDhK5IOveqCo1WBmS.jpeg","isPro":false,"fullname":"liangxiao","user":"liangxiaovvv","type":"user"},{"_id":"66bb2cd48c977a3f86c0f52f","avatarUrl":"/avatars/4f9521a2f2a79fff198bc062de4c430c.svg","isPro":false,"fullname":"Zijian Kang","user":"ZijianKang","type":"user"},{"_id":"6700fc0c5ce58dd0c36f32b9","avatarUrl":"/avatars/97e3017c4d19aa5670f6a33d7664cf29.svg","isPro":false,"fullname":"brodra","user":"brodra","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2509.14033

SAIL-VL2 Technical Report

Published on Sep 17, 2025
· Submitted by
taesiri
on Sep 18, 2025
Authors:
,
,
,
,
,
,
,
,
,

Abstract

SAIL-VL2, a vision-language foundation model, achieves state-of-the-art performance across diverse benchmarks through data curation, progressive training, and sparse MoE architecture.

AI-generated summary

We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.

Community

Paper submitter

We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Thanks!

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.14033 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.14033 in a Space README.md to link it from this page.

Collections including this paper 5