Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-08-15T01:35:45.349Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6956459879875183},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2508.09945","authors":[{"_id":"689da0ffb083e610d741eb02","name":"Lingjie Jiang","hidden":false},{"_id":"689da0ffb083e610d741eb03","user":{"_id":"632bd2f72d6a805eeb4bc601","avatarUrl":"/avatars/6e1533e8a599f3068290aa69ac82cab7.svg","isPro":false,"fullname":"HUANG SHAOHAN","user":"buaahsh","type":"user"},"name":"Shaohan Huang","status":"claimed_verified","statusLastChangedAt":"2025-08-28T08:56:52.090Z","hidden":false},{"_id":"689da0ffb083e610d741eb04","name":"Xun Wu","hidden":false},{"_id":"689da0ffb083e610d741eb05","user":{"_id":"6645bdf6621ded608be9c37e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6645bdf6621ded608be9c37e/lHFUPBkqCyKc6KCaBUaCs.jpeg","isPro":false,"fullname":"Yixia Li","user":"X1AOX1A","type":"user"},"name":"Yixia Li","status":"claimed_verified","statusLastChangedAt":"2025-12-25T20:56:36.002Z","hidden":false},{"_id":"689da0ffb083e610d741eb06","name":"Dongdong Zhang","hidden":false},{"_id":"689da0ffb083e610d741eb07","name":"Furu Wei","hidden":false}],"publishedAt":"2025-08-13T17:00:44.000Z","submittedOnDailyAt":"2025-08-14T07:11:18.634Z","title":"VisCodex: Unified Multimodal Code Generation via Merging Vision and\n Coding Models","submittedOnDailyBy":{"_id":"66ab80e9bfb7d73a56bc293c","avatarUrl":"/avatars/9644266304c832c74ef572b5eb2d9468.svg","isPro":false,"fullname":"Jack","user":"lingjie23","type":"user"},"summary":"Multimodal large language models (MLLMs) have significantly advanced the\nintegration of visual and textual understanding. However, their ability to\ngenerate code from multimodal inputs remains limited. In this work, we\nintroduce VisCodex, a unified framework that seamlessly merges vision and\ncoding language models to empower MLLMs with strong multimodal code generation\nabilities. Leveraging a task vector-based model merging technique, we integrate\na state-of-the-art coding LLM into a strong vision-language backbone, while\npreserving both visual comprehension and advanced coding skills. To support\ntraining and evaluation, we introduce the Multimodal Coding Dataset (MCD), a\nlarge-scale and diverse collection of 598k samples, including high-quality HTML\ncode, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic\nproblems. Furthermore, we propose InfiBench-V, a novel and challenging\nbenchmark specifically designed to assess models on visually-rich, real-world\nprogramming questions that demand a nuanced understanding of both textual and\nvisual contexts. Extensive experiments show that VisCodex achieves\nstate-of-the-art performance among open-source MLLMs and approaches proprietary\nmodels like GPT-4o, highlighting the effectiveness of our model merging\nstrategy and new datasets.","upvotes":6,"discussionId":"689da100b083e610d741eb08","projectPage":"https://aka.ms/GeneralAI","githubRepo":"https://github.com/JackLingjie/VisCodex","githubRepoAddedBy":"user","ai_summary":"VisCodex integrates vision and coding models to enhance multimodal code generation, achieving top performance using a novel dataset and benchmark.","ai_keywords":["multimodal large language models","MLLMs","VisCodex","task vector-based model merging","coding LLM","vision-language backbone","Multimodal Coding Dataset","MCD","InfiBench-V","visually-rich programming questions"],"githubStars":21},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66ab80e9bfb7d73a56bc293c","avatarUrl":"/avatars/9644266304c832c74ef572b5eb2d9468.svg","isPro":false,"fullname":"Jack","user":"lingjie23","type":"user"},{"_id":"66c453199f88f7346c15daf2","avatarUrl":"/avatars/6f3dc33d0eb3a989638c0f3c6d92918f.svg","isPro":false,"fullname":"Xun Wu","user":"YUSHUIWX2","type":"user"},{"_id":"632bd2f72d6a805eeb4bc601","avatarUrl":"/avatars/6e1533e8a599f3068290aa69ac82cab7.svg","isPro":false,"fullname":"HUANG SHAOHAN","user":"buaahsh","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"667a3f0f32a816cc2b2b3f03","avatarUrl":"/avatars/b4d87ba5860b5b7efa8f1959ac4c6600.svg","isPro":false,"fullname":"Tran Van Cuong","user":"tvcuong89","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2508.09945

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

Published on Aug 13, 2025
· Submitted by
Jack
on Aug 14, 2025
Authors:
,
,
,

Abstract

VisCodex integrates vision and coding models to enhance multimodal code generation, achieving top performance using a novel dataset and benchmark.

AI-generated summary

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.

Community

Paper submitter

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.09945 in a model README.md to link it from this page.

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.09945 in a Space README.md to link it from this page.

Collections including this paper 1