Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Multilingual Multimodal Software Developer for Code Generation (2025)
UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models (2025)
Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning (2025)
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models (2025)
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation (2025)
MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces (2025)
Kwai Keye-VL Technical Report (2025)

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-08-15T01:35:45.349Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6956459879875183},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2508.09945","authors":[{"_id":"689da0ffb083e610d741eb02","name":"Lingjie Jiang","hidden":false},{"_id":"689da0ffb083e610d741eb03","user":{"_id":"632bd2f72d6a805eeb4bc601","avatarUrl":"/avatars/6e1533e8a599f3068290aa69ac82cab7.svg","isPro":false,"fullname":"HUANG SHAOHAN","user":"buaahsh","type":"user"},"name":"Shaohan Huang","status":"claimed_verified","statusLastChangedAt":"2025-08-28T08:56:52.090Z","hidden":false},{"_id":"689da0ffb083e610d741eb04","name":"Xun Wu","hidden":false},{"_id":"689da0ffb083e610d741eb05","user":{"_id":"6645bdf6621ded608be9c37e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6645bdf6621ded608be9c37e/lHFUPBkqCyKc6KCaBUaCs.jpeg","isPro":false,"fullname":"Yixia Li","user":"X1AOX1A","type":"user"},"name":"Yixia Li","status":"claimed_verified","statusLastChangedAt":"2025-12-25T20:56:36.002Z","hidden":false},{"_id":"689da0ffb083e610d741eb06","name":"Dongdong Zhang","hidden":false},{"_id":"689da0ffb083e610d741eb07","name":"Furu Wei","hidden":false}],"publishedAt":"2025-08-13T17:00:44.000Z","submittedOnDailyAt":"2025-08-14T07:11:18.634Z","title":"VisCodex: Unified Multimodal Code Generation via Merging Vision and\n Coding Models","submittedOnDailyBy":{"_id":"66ab80e9bfb7d73a56bc293c","avatarUrl":"/avatars/9644266304c832c74ef572b5eb2d9468.svg","isPro":false,"fullname":"Jack","user":"lingjie23","type":"user"},"summary":"Multimodal large language models (MLLMs) have significantly advanced the\nintegration of visual and textual understanding. However, their ability to\ngenerate code from multimodal inputs remains limited. In this work, we\nintroduce VisCodex, a unified framework that seamlessly merges vision and\ncoding language models to empower MLLMs with strong multimodal code generation\nabilities. Leveraging a task vector-based model merging technique, we integrate\na state-of-the-art coding LLM into a strong vision-language backbone, while\npreserving both visual comprehension and advanced coding skills. To support\ntraining and evaluation, we introduce the Multimodal Coding Dataset (MCD), a\nlarge-scale and diverse collection of 598k samples, including high-quality HTML\ncode, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic\nproblems. Furthermore, we propose InfiBench-V, a novel and challenging\nbenchmark specifically designed to assess models on visually-rich, real-world\nprogramming questions that demand a nuanced understanding of both textual and\nvisual contexts. Extensive experiments show that VisCodex achieves\nstate-of-the-art performance among open-source MLLMs and approaches proprietary\nmodels like GPT-4o, highlighting the effectiveness of our model merging\nstrategy and new datasets.","upvotes":6,"discussionId":"689da100b083e610d741eb08","projectPage":"https://aka.ms/GeneralAI","githubRepo":"https://github.com/JackLingjie/VisCodex","githubRepoAddedBy":"user","ai_summary":"VisCodex integrates vision and coding models to enhance multimodal code generation, achieving top performance using a novel dataset and benchmark.","ai_keywords":["multimodal large language models","MLLMs","VisCodex","task vector-based model merging","coding LLM","vision-language backbone","Multimodal Coding Dataset","MCD","InfiBench-V","visually-rich programming questions"],"githubStars":21},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66ab80e9bfb7d73a56bc293c","avatarUrl":"/avatars/9644266304c832c74ef572b5eb2d9468.svg","isPro":false,"fullname":"Jack","user":"lingjie23","type":"user"},{"_id":"66c453199f88f7346c15daf2","avatarUrl":"/avatars/6f3dc33d0eb3a989638c0f3c6d92918f.svg","isPro":false,"fullname":"Xun Wu","user":"YUSHUIWX2","type":"user"},{"_id":"632bd2f72d6a805eeb4bc601","avatarUrl":"/avatars/6e1533e8a599f3068290aa69ac82cab7.svg","isPro":false,"fullname":"HUANG SHAOHAN","user":"buaahsh","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"667a3f0f32a816cc2b2b3f03","avatarUrl":"/avatars/b4d87ba5860b5b7efa8f1959ac4c6600.svg","isPro":false,"fullname":"Tran Van Cuong","user":"tvcuong89","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2508.09945

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

Published on Aug 13, 2025

· Submitted by

Jack on Aug 14, 2025

Upvote

Authors:

Shaohan Huang ,

Yixia Li ,

Abstract

VisCodex integrates vision and coding models to enhance multimodal code generation, achieving top performance using a novel dataset and benchmark.

AI-generated summary

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.