Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
[go: Go Back, main page]

https://huggingface.co/weizhiwang/Open-Qwen2VL

\n","updatedAt":"2025-04-02T03:32:51.597Z","author":{"_id":"63d34004b734eaa4d4faeccf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d34004b734eaa4d4faeccf/zf6d1p0GN8gsagi8N6y4V.jpeg","fullname":"Weizhi Wang","name":"weizhiwang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.892184317111969},"editors":["weizhiwang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63d34004b734eaa4d4faeccf/zf6d1p0GN8gsagi8N6y4V.jpeg"],"reactions":[],"isReport":false}},{"id":"67ed033f0a67a8af26b42f75","author":{"_id":"644662145004f2cb3af08b27","avatarUrl":"/avatars/5f2af24c7410a5db46374d0b84fb479d.svg","fullname":"Avishai Elmakies","name":"avishai-elmakies","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2025-04-02T09:28:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This looks great! We need to push more open-source and Compute-Efficient methods to train good models. We released a similar paper recently on training speech language models in a compute-constrained setting https://huggingface.co/papers/2502.15814 ","html":"

This looks great! We need to push more open-source and Compute-Efficient methods to train good models. We released a similar paper recently on training speech language models in a compute-constrained setting https://huggingface.co/papers/2502.15814

\n","updatedAt":"2025-04-02T09:28:31.720Z","author":{"_id":"644662145004f2cb3af08b27","avatarUrl":"/avatars/5f2af24c7410a5db46374d0b84fb479d.svg","fullname":"Avishai Elmakies","name":"avishai-elmakies","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9239227175712585},"editors":["avishai-elmakies"],"editorAvatarUrls":["/avatars/5f2af24c7410a5db46374d0b84fb479d.svg"],"reactions":[],"isReport":false}},{"id":"67ed0ae84a5c7f6759b7e0c9","author":{"_id":"62d648291fa3e4e7ae3fa6e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d648291fa3e4e7ae3fa6e8/oatOwf8Xqe5eDbCSuYqCd.png","fullname":"ben burtenshaw","name":"burtenshaw","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":4319,"isUserFollowing":false},"createdAt":"2025-04-02T10:01:12.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Open-Qwen2VL is interesting but is Table 1 correct in relation to Idefics? The models and datasets are available on the hub, and much of the code for datasets and models are on github.\n\n![Screenshot 2025-04-02 at 11.57.27.png](https://cdn-uploads.huggingface.co/production/uploads/62d648291fa3e4e7ae3fa6e8/3oq8Hq-pf93BOwYVLuihy.png)\n\nHere's a non-exhaustive list of open resources:\n- data creation https://github.com/huggingface/OBELICS\n- datasets https://huggingface.co/collections/HuggingFaceM4/obelics-6509a2ef647a3ea442ce2fbd\n- data paper https://huggingface.co/papers/2306.16527\n- models https://huggingface.co/collections/HuggingFaceM4/idefics-6509a1aaabdde5290e80b855\n","html":"

Open-Qwen2VL is interesting but is Table 1 correct in relation to Idefics? The models and datasets are available on the hub, and much of the code for datasets and models are on github.

\n

\"Screenshot

\n

Here's a non-exhaustive list of open resources:

\n\n","updatedAt":"2025-04-02T10:04:08.466Z","author":{"_id":"62d648291fa3e4e7ae3fa6e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d648291fa3e4e7ae3fa6e8/oatOwf8Xqe5eDbCSuYqCd.png","fullname":"ben burtenshaw","name":"burtenshaw","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":4319,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.778732419013977},"editors":["burtenshaw"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62d648291fa3e4e7ae3fa6e8/oatOwf8Xqe5eDbCSuYqCd.png"],"reactions":[],"isReport":false},"replies":[{"id":"67ed17cdf822fc9588b3dfef","author":{"_id":"63d34004b734eaa4d4faeccf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d34004b734eaa4d4faeccf/zf6d1p0GN8gsagi8N6y4V.jpeg","fullname":"Weizhi Wang","name":"weizhiwang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false},"createdAt":"2025-04-02T10:56:13.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Hi, we apologize for the wrong information in Table 1. We update the Table 1 for both arxiv paper and project website.","html":"

Hi, we apologize for the wrong information in Table 1. We update the Table 1 for both arxiv paper and project website.

\n","updatedAt":"2025-04-02T11:07:47.619Z","author":{"_id":"63d34004b734eaa4d4faeccf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d34004b734eaa4d4faeccf/zf6d1p0GN8gsagi8N6y4V.jpeg","fullname":"Weizhi Wang","name":"weizhiwang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.79395991563797},"editors":["weizhiwang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63d34004b734eaa4d4faeccf/zf6d1p0GN8gsagi8N6y4V.jpeg"],"reactions":[{"reaction":"🤗","users":["andito","burtenshaw"],"count":2}],"isReport":false,"parentCommentId":"67ed0ae84a5c7f6759b7e0c9"}}]},{"id":"67ed0d4d119f1c10a0d6fba4","author":{"_id":"65d66b494bbd0d92b641cdbb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg","fullname":"Andres Marafioti","name":"andito","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":501,"isUserFollowing":false},"createdAt":"2025-04-02T10:11:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"All of the pre-training codebase is open as well: https://github.com/huggingface/smollm/tree/main/vision/m4\n","html":"

All of the pre-training codebase is open as well: https://github.com/huggingface/smollm/tree/main/vision/m4

\n","updatedAt":"2025-04-02T10:11:25.256Z","author":{"_id":"65d66b494bbd0d92b641cdbb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg","fullname":"Andres Marafioti","name":"andito","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":501,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8825057744979858},"editors":["andito"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg"],"reactions":[{"reaction":"👍","users":["vishaal27"],"count":1}],"isReport":false}},{"id":"67ed0d64754527bcbffeee8b","author":{"_id":"65d66b494bbd0d92b641cdbb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg","fullname":"Andres Marafioti","name":"andito","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":501,"isUserFollowing":false},"createdAt":"2025-04-02T10:11:48.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"So, sequence packing scripts and data filtering techniques should also be open for idefics.","html":"

So, sequence packing scripts and data filtering techniques should also be open for idefics.

\n","updatedAt":"2025-04-02T10:11:48.051Z","author":{"_id":"65d66b494bbd0d92b641cdbb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg","fullname":"Andres Marafioti","name":"andito","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":501,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.909347653388977},"editors":["andito"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg"],"reactions":[{"reaction":"👍","users":["vishaal27"],"count":1}],"isReport":false}},{"id":"67ede5a04aa6f6bda1822201","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-03T01:34:24.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Should VLMs be Pre-trained with Image Data?](https://huggingface.co/papers/2503.07603) (2025)\n* [FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression](https://huggingface.co/papers/2502.18512) (2025)\n* [BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries](https://huggingface.co/papers/2503.12446) (2025)\n* [LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning](https://huggingface.co/papers/2503.15621) (2025)\n* [M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance](https://huggingface.co/papers/2502.18778) (2025)\n* [OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models](https://huggingface.co/papers/2503.08686) (2025)\n* [HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding](https://huggingface.co/papers/2503.14694) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-03T01:34:24.053Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7109079360961914},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.00595","authors":[{"_id":"67ecaf516560da48c5c34106","user":{"_id":"63d34004b734eaa4d4faeccf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d34004b734eaa4d4faeccf/zf6d1p0GN8gsagi8N6y4V.jpeg","isPro":false,"fullname":"Weizhi Wang","user":"weizhiwang","type":"user"},"name":"Weizhi Wang","status":"claimed_verified","statusLastChangedAt":"2025-04-02T08:22:20.594Z","hidden":false},{"_id":"67ecaf516560da48c5c34107","name":"Yu Tian","hidden":false},{"_id":"67ecaf516560da48c5c34108","user":{"_id":"67cd0b291580ba5d5ee65ffd","avatarUrl":"/avatars/9584a55473868e5ca1fa09b1536ca546.svg","isPro":false,"fullname":"yanglinjie","user":"yanglj55","type":"user"},"name":"Linjie Yang","status":"admin_assigned","statusLastChangedAt":"2025-04-02T09:55:04.382Z","hidden":false},{"_id":"67ecaf516560da48c5c34109","name":"Heng Wang","hidden":false},{"_id":"67ecaf516560da48c5c3410a","user":{"_id":"65cd4785c40ab294321d610e","avatarUrl":"/avatars/3f0053aa2b3d90a10b60ab24cf575fd5.svg","isPro":false,"fullname":"Xifeng Yan","user":"windwest","type":"user"},"name":"Xifeng Yan","status":"admin_assigned","statusLastChangedAt":"2025-04-02T09:54:38.935Z","hidden":false}],"publishedAt":"2025-04-01T09:54:00.000Z","submittedOnDailyAt":"2025-04-02T02:00:49.375Z","title":"Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal\n LLMs on Academic Resources","submittedOnDailyBy":{"_id":"63d34004b734eaa4d4faeccf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d34004b734eaa4d4faeccf/zf6d1p0GN8gsagi8N6y4V.jpeg","isPro":false,"fullname":"Weizhi Wang","user":"weizhiwang","type":"user"},"summary":"The reproduction of state-of-the-art multimodal LLM pre-training faces\nbarriers at every stage of the pipeline, including high-quality data filtering,\nmultimodal data mixture strategies, sequence packing techniques, and training\nframeworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter\nMultimodal Large Language Model pre-trained efficiently on 29M image-text pairs\nusing only 442 A100-40G GPU hours. Our approach employs low-to-high dynamic\nimage resolution and multimodal sequence packing to significantly enhance\npre-training efficiency. The training dataset was carefully curated using both\nMLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based\nfiltering methods, substantially improving data quality and training\nefficiency. The Open-Qwen2VL pre-training is conducted on academic level\n8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36\\% of 1.4T\nmultimodal pre-training tokens of Qwen2-VL. The final instruction-tuned\nOpen-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on\nvarious multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista,\nindicating the remarkable training efficiency of Open-Qwen2VL. We open-source\nall aspects of our work, including compute-efficient and data-efficient\ntraining details, data filtering methods, sequence packing scripts,\npre-training data in WebDataset format, FSDP-based training codebase, and both\nbase and instruction-tuned model checkpoints. We redefine \"fully open\" for\nmultimodal LLMs as the complete release of: 1) the training codebase, 2)\ndetailed data filtering techniques, and 3) all pre-training and supervised\nfine-tuning data used to develop the model.","upvotes":37,"discussionId":"67ecaf546560da48c5c341dc","projectPage":"https://victorwz.github.io/Open-Qwen2VL/","githubRepo":"https://github.com/Victorwz/Open-Qwen2VL","githubRepoAddedBy":"user","ai_summary":"Open-Qwen2VL is a 2B-parameter multimodal LLM pre-trained efficiently on image-text pairs with improved data filtering and sequence packing, outperforming state-of-the-art models on various benchmarks.","ai_keywords":["multimodal LLM","pre-training","image resolution","multimodal sequence packing","MLM-based filtering","CLIP-based filtering","academic level 8xA100-40G GPUs","FSDP-based training","WebDataset","MMBench","SEEDBench","MMstar","MathVista","compute-efficient","data-efficient"],"githubStars":306},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63d34004b734eaa4d4faeccf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d34004b734eaa4d4faeccf/zf6d1p0GN8gsagi8N6y4V.jpeg","isPro":false,"fullname":"Weizhi Wang","user":"weizhiwang","type":"user"},{"_id":"64209f6e69a2c29338806701","avatarUrl":"/avatars/d35e7ac6877417f387b6542ca5232b2f.svg","isPro":false,"fullname":"pentium","user":"pentium3","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","isPro":false,"fullname":"Byung-Kwan Lee","user":"BK-Lee","type":"user"},{"_id":"64a84de2eb47b3552285ef74","avatarUrl":"/avatars/114e0cc393d0aea9680f3af6d84d6f46.svg","isPro":false,"fullname":"Eni Grand","user":"Enigrand","type":"user"},{"_id":"644662145004f2cb3af08b27","avatarUrl":"/avatars/5f2af24c7410a5db46374d0b84fb479d.svg","isPro":false,"fullname":"Avishai Elmakies","user":"avishai-elmakies","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","isPro":false,"fullname":"Adina Yakefu","user":"AdinaY","type":"user"},{"_id":"60107b385ac3e86b3ea4fc34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1627505688463-60107b385ac3e86b3ea4fc34.jpeg","isPro":true,"fullname":"Daniel van Strien","user":"davanstrien","type":"user"},{"_id":"6450c03a673b2bcfaf86977f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Nv30D92Ga3o13WM7-BP0I.png","isPro":false,"fullname":"PZ","user":"philipp-zettl","type":"user"},{"_id":"65407ba7a38390065750233f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65407ba7a38390065750233f/1_IPMZbk-S9u2t18PQgMp.jpeg","isPro":false,"fullname":"Zirui Song","user":"Ziruibest","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2504.00595

Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

Published on Apr 1, 2025
· Submitted by
Weizhi Wang
on Apr 2, 2025
Authors:
,
,

Abstract

Open-Qwen2VL is a 2B-parameter multimodal LLM pre-trained efficiently on image-text pairs with improved data filtering and sequence packing, outperforming state-of-the-art models on various benchmarks.

AI-generated summary

The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 442 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36\% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine "fully open" for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.

Community

Paper author Paper submitter
edited Apr 2, 2025

This looks great! We need to push more open-source and Compute-Efficient methods to train good models. We released a similar paper recently on training speech language models in a compute-constrained setting https://huggingface.co/papers/2502.15814

Open-Qwen2VL is interesting but is Table 1 correct in relation to Idefics? The models and datasets are available on the hub, and much of the code for datasets and models are on github.

Screenshot 2025-04-02 at 11.57.27.png

Here's a non-exhaustive list of open resources:

·

Hi, we apologize for the wrong information in Table 1. We update the Table 1 for both arxiv paper and project website.

All of the pre-training codebase is open as well: https://github.com/huggingface/smollm/tree/main/vision/m4

So, sequence packing scripts and data filtering techniques should also be open for idefics.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.00595 in a Space README.md to link it from this page.

Collections including this paper 9