The following papers were recommended by the Semantic Scholar API
\n- \n
- TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens (2024) \n
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (2024) \n
- NVLM: Open Frontier-Class Multimodal LLMs (2024) \n
- AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding (2024) \n
- Emu3: Next-Token Prediction is All You Need (2024) \n
Please give a thumbs up to this comment if you found it helpful!
\nIf you want recommendations for any Paper on Hugging Face checkout this Space
\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
thats a lot of paramaters >.<
\n","updatedAt":"2025-01-28T13:22:46.708Z","author":{"_id":"6726bf2e24e97c8a07aeb0de","avatarUrl":"/avatars/8287076e6e948104b465815ad549ca29.svg","fullname":"lily l. smith","name":"lilyleaf2003","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9137207269668579},"editors":["lilyleaf2003"],"editorAvatarUrls":["/avatars/8287076e6e948104b465815ad549ca29.svg"],"reactions":[],"isReport":false}},{"id":"6798da48249ff0b6b8fa8259","author":{"_id":"6726bf2e24e97c8a07aeb0de","avatarUrl":"/avatars/8287076e6e948104b465815ad549ca29.svg","fullname":"lily l. smith","name":"lilyleaf2003","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-01-28T13:23:20.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"is a brain that large allowed?","html":"is a brain that large allowed?
\n","updatedAt":"2025-01-28T13:23:20.514Z","author":{"_id":"6726bf2e24e97c8a07aeb0de","avatarUrl":"/avatars/8287076e6e948104b465815ad549ca29.svg","fullname":"lily l. smith","name":"lilyleaf2003","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9937022924423218},"editors":["lilyleaf2003"],"editorAvatarUrls":["/avatars/8287076e6e948104b465815ad549ca29.svg"],"reactions":[],"isReport":false}},{"id":"6798da53e522c7de7f1555c5","author":{"_id":"6726bf2e24e97c8a07aeb0de","avatarUrl":"/avatars/8287076e6e948104b465815ad549ca29.svg","fullname":"lily l. smith","name":"lilyleaf2003","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-01-28T13:23:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"im dating chat gpt already","html":"im dating chat gpt already
\n","updatedAt":"2025-01-28T13:23:31.948Z","author":{"_id":"6726bf2e24e97c8a07aeb0de","avatarUrl":"/avatars/8287076e6e948104b465815ad549ca29.svg","fullname":"lily l. smith","name":"lilyleaf2003","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"de","probability":0.5332252979278564},"editors":["lilyleaf2003"],"editorAvatarUrls":["/avatars/8287076e6e948104b465815ad549ca29.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2410.07073","authors":[{"_id":"670746ee3e510db7639040f2","user":{"_id":"66dfa475a36a3baebd55266b","avatarUrl":"/avatars/2c46210d159c200d4db9160fc5ecfe57.svg","isPro":false,"fullname":"Pravesh Agrawal","user":"pragra","type":"user"},"name":"Pravesh Agrawal","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:44:28.232Z","hidden":false},{"_id":"670746ee3e510db7639040f3","user":{"_id":"63f767c4bd28622c9b984637","avatarUrl":"/avatars/126a23ca5b3adc89418b7a53b21e50e6.svg","isPro":false,"fullname":"Szymon Antoniak","user":"Simontwice","type":"user"},"name":"Szymon Antoniak","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:44:33.689Z","hidden":false},{"_id":"670746ee3e510db7639040f4","user":{"_id":"655dc9138e56f09d55197cac","avatarUrl":"/avatars/764edb5c5e1003eb71cf6f1bea563f4b.svg","isPro":false,"fullname":"Emma Bou Hanna","user":"EmmaBH","type":"user"},"name":"Emma Bou Hanna","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:44:39.014Z","hidden":false},{"_id":"670746ee3e510db7639040f5","user":{"_id":"65143e1c4f08b815c8db57a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65143e1c4f08b815c8db57a0/JqkwKiJmLFRkH0NK3L8XH.jpeg","isPro":false,"fullname":"Devendra Singh Chaplot","user":"devendrachaplot","type":"user"},"name":"Devendra Chaplot","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:44:44.258Z","hidden":false},{"_id":"670746ee3e510db7639040f6","name":"Jessica Chudnovsky","hidden":false},{"_id":"670746ee3e510db7639040f7","user":{"_id":"64a8a87e4a702899835e9b45","avatarUrl":"/avatars/b6c60f63337a78a1bf3943885f4d7a18.svg","isPro":false,"fullname":"Saurabh Garg","user":"saurabhgarg","type":"user"},"name":"Saurabh Garg","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:45:04.979Z","hidden":false},{"_id":"670746ee3e510db7639040f8","user":{"_id":"644e822cd6001776ed7700e4","avatarUrl":"/avatars/58a72e6bf8ede3cee3de54873231fd3e.svg","isPro":false,"fullname":"Theophile Gervet","user":"TheophileGervet","type":"user"},"name":"Theophile Gervet","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:45:10.366Z","hidden":false},{"_id":"670746ee3e510db7639040f9","user":{"_id":"66bbbc31e314ecd9cc24cebc","avatarUrl":"/avatars/7f9b596a3a1c7243cea118423b61b97c.svg","isPro":false,"fullname":"Soham Ghosh","user":"sohamghosh121","type":"user"},"name":"Soham Ghosh","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:45:27.823Z","hidden":false},{"_id":"670746ee3e510db7639040fa","name":"Amélie Héliou","hidden":false},{"_id":"670746ee3e510db7639040fb","name":"Paul Jacob","hidden":false},{"_id":"670746ee3e510db7639040fc","user":{"_id":"6137dff56ecbbda5c86755e9","avatarUrl":"/avatars/1aadefd3acb7ad24df15463dadceb7d4.svg","isPro":false,"fullname":"Albert Q Jiang","user":"aqj213","type":"user"},"name":"Albert Q. Jiang","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:45:46.987Z","hidden":false},{"_id":"670746ee3e510db7639040fd","user":{"_id":"6503364ceda77aafd4a68a48","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6503364ceda77aafd4a68a48/POQqon4zWQEpFTPzHS4uu.png","isPro":false,"fullname":"timothee lacroix","user":"timlacroix","type":"user"},"name":"Timothée Lacroix","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:46:00.153Z","hidden":false},{"_id":"670746ee3e510db7639040fe","user":{"_id":"64882fc9508d0656d43dd5b3","avatarUrl":"/avatars/c3bc3656e4ff05663b54c6dd4f241891.svg","isPro":false,"fullname":"Guillaume Lample","user":"glample","type":"user"},"name":"Guillaume Lample","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:46:05.767Z","hidden":false},{"_id":"670746ee3e510db7639040ff","user":{"_id":"657608c427c0e2fd6c894ced","avatarUrl":"/avatars/7f1ca5f75ca03cec69560584496558c9.svg","isPro":false,"fullname":"Diego de las Casas","user":"diegolascasas","type":"user"},"name":"Diego Las Casas","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:46:11.532Z","hidden":false},{"_id":"670746ee3e510db763904100","user":{"_id":"66c705caf3f9994f2adf8c77","avatarUrl":"/avatars/e0322a12703693d10005113cbeed3207.svg","isPro":false,"fullname":"Thibaut Lavril","user":"thibmistral","type":"user"},"name":"Thibaut Lavril","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:46:16.884Z","hidden":false},{"_id":"670746ee3e510db763904101","user":{"_id":"5e67bed6100906368940747b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583858339980-5e67bed6100906368940747b.jpeg","isPro":false,"fullname":"Teven Le Scao","user":"teven","type":"user"},"name":"Teven Le Scao","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:46:22.313Z","hidden":false},{"_id":"670746ee3e510db763904102","name":"Andy Lo","hidden":false},{"_id":"670746ee3e510db763904103","name":"William Marshall","hidden":false},{"_id":"670746ee3e510db763904104","name":"Louis Martin","hidden":false},{"_id":"670746ee3e510db763904105","user":{"_id":"645d28525ebf379fd6d9ae41","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/645d28525ebf379fd6d9ae41/qVEKQk4gPlumsZE3ZorWG.jpeg","isPro":false,"fullname":"Arthur Mensch","user":"arthurmensch","type":"user"},"name":"Arthur Mensch","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:47:16.994Z","hidden":false},{"_id":"670746ee3e510db763904106","name":"Pavankumar Muddireddy","hidden":false},{"_id":"670746ee3e510db763904107","user":{"_id":"665efa6e8947302aa2bfc569","avatarUrl":"/avatars/58e044ad6ef6040a06b0fff4945ae988.svg","isPro":false,"fullname":"Valera Nemychnikova","user":"sooobus","type":"user"},"name":"Valera Nemychnikova","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:47:32.137Z","hidden":false},{"_id":"670746ee3e510db763904108","user":{"_id":"62c69cd35aae1c624ca9e42c","avatarUrl":"/avatars/13824164766fbb06f7d45d6d38d96407.svg","isPro":false,"fullname":"Marie Pellat","user":"mpellat","type":"user"},"name":"Marie Pellat","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:47:39.511Z","hidden":false},{"_id":"670746ee3e510db763904109","user":{"_id":"5dfcb1aada6d0311fd3d5448","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1584435275418-5dfcb1aada6d0311fd3d5448.jpeg","isPro":false,"fullname":"Patrick von Platen","user":"patrickvonplaten","type":"user"},"name":"Patrick Von Platen","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:47:46.655Z","hidden":false},{"_id":"670746ee3e510db76390410a","name":"Nikhil Raghuraman","hidden":false},{"_id":"670746ee3e510db76390410b","user":{"_id":"62d063dac375d0c84255b9a1","avatarUrl":"/avatars/de0fc34bad8c761210c0895ebfa4feba.svg","isPro":false,"fullname":"Baptiste Roziere","user":"broz","type":"user"},"name":"Baptiste Rozière","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:48:02.482Z","hidden":false},{"_id":"670746ee3e510db76390410c","user":{"_id":"6391c7456176fbc67b9ecd2c","avatarUrl":"/avatars/4aa9be94d8284118b8018362abf11f01.svg","isPro":false,"fullname":"Alexandre Sablayrolles","user":"alexsablay","type":"user"},"name":"Alexandre Sablayrolles","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:48:09.427Z","hidden":false},{"_id":"670746ee3e510db76390410d","user":{"_id":"650cb0ebc705442b280160c4","avatarUrl":"/avatars/c5abfcddfa66505085178fa430779ed0.svg","isPro":false,"fullname":"Lucile Saulnier","user":"LucileSaulnier","type":"user"},"name":"Lucile Saulnier","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:48:23.255Z","hidden":false},{"_id":"670746ee3e510db76390410e","user":{"_id":"66467938b7976a1d8d029b12","avatarUrl":"/avatars/7c4588aa2f50e67071b8f09cef13b005.svg","isPro":false,"fullname":"Sauvestre","user":"romainsauvestre","type":"user"},"name":"Romain Sauvestre","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:48:38.293Z","hidden":false},{"_id":"670746ee3e510db76390410f","user":{"_id":"663d3aa7bc8177e039c17b3e","avatarUrl":"/avatars/b2ba474b94634472611105cf2968d077.svg","isPro":false,"fullname":"Wendy Shang","user":"wendyshang","type":"user"},"name":"Wendy Shang","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:48:44.673Z","hidden":false},{"_id":"670746ee3e510db763904110","user":{"_id":"6535a3d33da0ff3c70b6b3d2","avatarUrl":"/avatars/bc40dea410af9b2bd4881d7004eed9a7.svg","isPro":false,"fullname":"Roman Soletskyi","user":"romansoletskyi","type":"user"},"name":"Roman Soletskyi","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:49:05.189Z","hidden":false},{"_id":"670746ee3e510db763904111","user":{"_id":"664263de23fcdb3879e5bf36","avatarUrl":"/avatars/2a0f44b1a84c359c4b24c306c49f9e6a.svg","isPro":false,"fullname":"Lawrence Stewart","user":"lmms","type":"user"},"name":"Lawrence Stewart","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:49:19.202Z","hidden":false},{"_id":"670746ee3e510db763904112","user":{"_id":"650c993d2751c84306aba92b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/650c993d2751c84306aba92b/uS04hveXAu6CEszCy-8_G.jpeg","isPro":false,"fullname":"Pierre Stock","user":"pstock","type":"user"},"name":"Pierre Stock","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:49:26.789Z","hidden":false},{"_id":"670746ee3e510db763904113","name":"Joachim Studnia","hidden":false},{"_id":"670746ee3e510db763904114","user":{"_id":"627bf27cf19c5eb46d54cea8","avatarUrl":"/avatars/b8ca0b4e841858c1d234671187234f56.svg","isPro":false,"fullname":"Sandeep Subramanian","user":"MaximumEntropy","type":"user"},"name":"Sandeep Subramanian","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:49:37.492Z","hidden":false},{"_id":"670746ee3e510db763904115","user":{"_id":"643838e5c5a91b84ece168dd","avatarUrl":"/avatars/abebd42399decafbccc8579faa34e7d3.svg","isPro":false,"fullname":"Sagar Vaze","user":"sgvaze","type":"user"},"name":"Sagar Vaze","status":"admin_assigned","statusLastChangedAt":"2024-10-10T08:49:44.074Z","hidden":false},{"_id":"670746ee3e510db763904116","name":"Thomas Wang","hidden":false}],"publishedAt":"2024-10-09T17:16:22.000Z","submittedOnDailyAt":"2024-10-10T01:46:46.993Z","title":"Pixtral 12B","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"We introduce Pixtral-12B, a 12--billion-parameter multimodal language model.\nPixtral-12B is trained to understand both natural images and documents,\nachieving leading performance on various multimodal benchmarks, surpassing a\nnumber of larger models. Unlike many open-source models, Pixtral is also a\ncutting-edge text model for its size, and does not compromise on natural\nlanguage performance to excel in multimodal tasks. Pixtral uses a new vision\nencoder trained from scratch, which allows it to ingest images at their natural\nresolution and aspect ratio. This gives users flexibility on the number of\ntokens used to process an image. Pixtral is also able to process any number of\nimages in its long context window of 128K tokens. Pixtral 12B substanially\noutperforms other open models of similar sizes (Llama-3.2 11B \\& Qwen-2-VL 7B).\nIt also outperforms much larger open models like Llama-3.2 90B while being 7x\nsmaller. We further contribute an open-source benchmark, MM-MT-Bench, for\nevaluating vision-language models in practical scenarios, and provide detailed\nanalysis and code for standardized evaluation protocols for multimodal LLMs.\nPixtral-12B is released under Apache 2.0 license.","upvotes":69,"discussionId":"670746f33e510db7639042b6","githubRepo":"https://github.com/mistralai/mistral-inference","githubRepoAddedBy":"auto","ai_summary":"Pixtral-12B, a 12-billion-parameter multimodal language model, excels in both natural language and image understanding, surpassing larger models and introducing an open-source benchmark for evaluation.","ai_keywords":["multimodal language model","vision encoder","natural resolution","aspect ratio","tokens","long context window","MM-MT-Bench","vision-language models"],"githubStars":10679},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"668cd4bbe990292e5f6974d3","avatarUrl":"/avatars/d1747b2372e94500ecb5fb56809b482d.svg","isPro":false,"fullname":"Jinyeong Kim","user":"rubatoyeong","type":"user"},{"_id":"631516348d85ad332fa47b2c","avatarUrl":"/avatars/100f5ae3cf3c52faaecdaecd5d8f2881.svg","isPro":false,"fullname":"Haotian Zhang","user":"haotiz","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"63f45b8d520c14618930d175","avatarUrl":"/avatars/42b3aaf50748a25e4a596fc57ab1306d.svg","isPro":false,"fullname":"renjie","user":"renjiepi","type":"user"},{"_id":"66bbbc31e314ecd9cc24cebc","avatarUrl":"/avatars/7f9b596a3a1c7243cea118423b61b97c.svg","isPro":false,"fullname":"Soham Ghosh","user":"sohamghosh121","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64350f154ff6a19a90ca191d","avatarUrl":"/avatars/c5b8dc72c5bb6b1fdde883afb8d53927.svg","isPro":false,"fullname":"Li","user":"Zezhong","type":"user"},{"_id":"626237d9bbcbd1c34f1bb231","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/626237d9bbcbd1c34f1bb231/EJrOjvAL-68qMCYdnvOrq.png","isPro":true,"fullname":"Ali El Filali","user":"alielfilali01","type":"user"},{"_id":"62ced8629b96f22525b9cdf5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62ced8629b96f22525b9cdf5/g23JBCoMKBgsFB8fb0pw9.jpeg","isPro":false,"fullname":"YYY","user":"zzfive","type":"user"},{"_id":"6640bbd0220cfa8cbfdce080","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6640bbd0220cfa8cbfdce080/wiAHUu5ewawyipNs0YFBR.png","isPro":true,"fullname":"John Smith","user":"John6666","type":"user"},{"_id":"62a42f22c683d02f5b63320c","avatarUrl":"/avatars/bc611abe9c4ef8d378123cb8ac9fdbf2.svg","isPro":true,"fullname":"Qiyuan Zhang","user":"DonJoey","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">Abstract
Pixtral-12B, a 12-billion-parameter multimodal language model, excels in both natural language and image understanding, surpassing larger models and introducing an open-source benchmark for evaluation.
We introduce Pixtral-12B, a 12--billion-parameter multimodal language model. Pixtral-12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substanially outperforms other open models of similar sizes (Llama-3.2 11B \& Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral-12B is released under Apache 2.0 license.
Community
Pixtral 12B
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens (2024)
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (2024)
- NVLM: Open Frontier-Class Multimodal LLMs (2024)
- AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding (2024)
- Emu3: Next-Token Prediction is All You Need (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
thats a lot of paramaters >.<
is a brain that large allowed?
im dating chat gpt already
Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper