Lingshu-32B outperforms GPT-4.1 and Claude Sonnet 4 in most multimodal QA and report generation tasks.\n\n","updatedAt":"2025-06-10T02:18:48.102Z","author":{"_id":"604f67ef0fe8ff3ec13d71ef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/604f67ef0fe8ff3ec13d71ef/KhUwWvZ3OJ9nEee3B-SXO.png","fullname":"Hou Pong (Ken) Chan","name":"kenchan0226","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":16,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.796971321105957},"editors":["kenchan0226"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/604f67ef0fe8ff3ec13d71ef/KhUwWvZ3OJ9nEee3B-SXO.png"],"reactions":[],"isReport":false}},{"id":"6847cf4bd0ad6bf06f935ddc","author":{"_id":"641433e38900ef6afa3095f1","avatarUrl":"/avatars/376ee167592ce03d723849f210788ebf.svg","fullname":"Henri Gelender","name":"MortFee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-06-10T06:23:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"\n![thumbnail.png](https://cdn-uploads.huggingface.co/production/uploads/641433e38900ef6afa3095f1/NZFSbGYYTO_5QpNeowJA1.png)\n","html":"

\n","updatedAt":"2025-06-10T06:23:07.604Z","author":{"_id":"641433e38900ef6afa3095f1","avatarUrl":"/avatars/376ee167592ce03d723849f210788ebf.svg","fullname":"Henri Gelender","name":"MortFee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3530321717262268},"editors":["MortFee"],"editorAvatarUrls":["/avatars/376ee167592ce03d723849f210788ebf.svg"],"reactions":[],"isReport":false}},{"id":"684a3003e40a99007ea33aa0","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-06-12T01:40:19.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MEDMKG: Benchmarking Medical Knowledge Exploitation with Multimodal Knowledge Graph](https://huggingface.co/papers/2505.17214) (2025)\n* [CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs](https://huggingface.co/papers/2505.24120) (2025)\n* [Infi-Med: Low-Resource Medical MLLMs with Robust Reasoning Evaluation](https://huggingface.co/papers/2505.23867) (2025)\n* [ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification](https://huggingface.co/papers/2504.20930) (2025)\n* [Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios](https://huggingface.co/papers/2505.23118) (2025)\n* [VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge](https://huggingface.co/papers/2504.10342) (2025)\n* [MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM](https://huggingface.co/papers/2505.24238) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-06-12T01:40:19.743Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7118139266967773},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2506.07044","authors":[{"_id":"684795093ec10bdd8ab4de43","name":"LASA Team","hidden":false},{"_id":"684795093ec10bdd8ab4de44","user":{"_id":"64118689756b9e455c7eac62","avatarUrl":"/avatars/cdb3da22593facf545a0bafbf548b07e.svg","isPro":false,"fullname":"Xu Weiwen","user":"xww033","type":"user"},"name":"Weiwen Xu","status":"claimed_verified","statusLastChangedAt":"2025-06-10T08:44:07.459Z","hidden":false},{"_id":"684795093ec10bdd8ab4de45","user":{"_id":"604f67ef0fe8ff3ec13d71ef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/604f67ef0fe8ff3ec13d71ef/KhUwWvZ3OJ9nEee3B-SXO.png","isPro":false,"fullname":"Hou Pong (Ken) Chan","user":"kenchan0226","type":"user"},"name":"Hou Pong Chan","status":"claimed_verified","statusLastChangedAt":"2025-06-10T08:44:05.163Z","hidden":false},{"_id":"684795093ec10bdd8ab4de46","user":{"_id":"6365d83ce7a78348d82572b0","avatarUrl":"/avatars/d50587902cced2c3640fd5de82ff78dd.svg","isPro":false,"fullname":"ll","user":"jianghuyihei","type":"user"},"name":"Long Li","status":"claimed_verified","statusLastChangedAt":"2025-07-15T19:12:01.942Z","hidden":false},{"_id":"684795093ec10bdd8ab4de47","name":"Mahani Aljunied","hidden":false},{"_id":"684795093ec10bdd8ab4de48","name":"Ruifeng Yuan","hidden":false},{"_id":"684795093ec10bdd8ab4de49","user":{"_id":"61e09ec13a1781f66b4e9ae2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1642110635503-noauth.jpeg","isPro":false,"fullname":"Jianyu Wang","user":"Jianyu","type":"user"},"name":"Jianyu Wang","status":"claimed_verified","statusLastChangedAt":"2025-06-10T08:44:03.340Z","hidden":false},{"_id":"684795093ec10bdd8ab4de4a","user":{"_id":"63108cc834c7d77420b0fd68","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63108cc834c7d77420b0fd68/taDnqEmcI9Rhe3uzcPEE3.jpeg","isPro":false,"fullname":"Chenghao Xiao","user":"gowitheflow","type":"user"},"name":"Chenghao Xiao","status":"claimed_verified","statusLastChangedAt":"2025-06-10T10:59:59.163Z","hidden":false},{"_id":"684795093ec10bdd8ab4de4b","name":"Guizhen Chen","hidden":false},{"_id":"684795093ec10bdd8ab4de4c","name":"Chaoqun Liu","hidden":false},{"_id":"684795093ec10bdd8ab4de4d","name":"Zhaodonghui Li","hidden":false},{"_id":"684795093ec10bdd8ab4de4e","user":{"_id":"6723079ad1306fe9c76a1d29","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b4BNPCeZs59MKxo1qmT6r.png","isPro":false,"fullname":"Yu Sun","user":"YuSun-AI","type":"user"},"name":"Yu Sun","status":"claimed_verified","statusLastChangedAt":"2025-06-11T08:36:05.334Z","hidden":false},{"_id":"684795093ec10bdd8ab4de4f","name":"Junao Shen","hidden":false},{"_id":"684795093ec10bdd8ab4de50","name":"Chaojun Wang","hidden":false},{"_id":"684795093ec10bdd8ab4de51","name":"Jie Tan","hidden":false},{"_id":"684795093ec10bdd8ab4de52","name":"Deli Zhao","hidden":false},{"_id":"684795093ec10bdd8ab4de53","name":"Tingyang Xu","hidden":false},{"_id":"684795093ec10bdd8ab4de54","user":{"_id":"64b7cd74ff6d81ae297feded","avatarUrl":"/avatars/880fbc96cc093f5e901ce84f32a1d21d.svg","isPro":false,"fullname":"ZHANG HAO","user":"26hzhang","type":"user"},"name":"Hao Zhang","status":"claimed_verified","statusLastChangedAt":"2025-06-10T11:20:59.561Z","hidden":false},{"_id":"684795093ec10bdd8ab4de55","user":{"_id":"642eecbf9b2484d7d8526781","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642eecbf9b2484d7d8526781/4IvGbd66s49Wx5pZyZGHA.png","isPro":false,"fullname":"Yu Rong","user":"Swrooy","type":"user"},"name":"Yu Rong","status":"claimed_verified","statusLastChangedAt":"2025-06-10T08:44:01.224Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/604f67ef0fe8ff3ec13d71ef/R3ajyza5JHjd8tOwwV2ht.png"],"publishedAt":"2025-06-08T08:47:30.000Z","submittedOnDailyAt":"2025-06-10T00:48:48.080Z","title":"Lingshu: A Generalist Foundation Model for Unified Multimodal Medical\n Understanding and Reasoning","submittedOnDailyBy":{"_id":"604f67ef0fe8ff3ec13d71ef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/604f67ef0fe8ff3ec13d71ef/KhUwWvZ3OJ9nEee3B-SXO.png","isPro":false,"fullname":"Hou Pong (Ken) Chan","user":"kenchan0226","type":"user"},"summary":"Multimodal Large Language Models (MLLMs) have demonstrated impressive\ncapabilities in understanding common visual elements, largely due to their\nlarge-scale datasets and advanced training strategies. However, their\neffectiveness in medical applications remains limited due to the inherent\ndiscrepancies between data and tasks in medical scenarios and those in the\ngeneral domain. Concretely, existing medical MLLMs face the following critical\nlimitations: (1) limited coverage of medical knowledge beyond imaging, (2)\nheightened susceptibility to hallucinations due to suboptimal data curation\nprocesses, (3) lack of reasoning capabilities tailored for complex medical\nscenarios. To address these challenges, we first propose a comprehensive data\ncuration procedure that (1) efficiently acquires rich medical knowledge data\nnot only from medical imaging but also from extensive medical texts and\ngeneral-domain data; and (2) synthesizes accurate medical captions, visual\nquestion answering (VQA), and reasoning samples. As a result, we build a\nmultimodal dataset enriched with extensive medical knowledge. Building on the\ncurated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu\nundergoes multi-stage training to embed medical expertise and enhance its\ntask-solving capabilities progressively. Besides, we preliminarily explore the\npotential of applying reinforcement learning with verifiable rewards paradigm\nto enhance Lingshu's medical reasoning ability. Additionally, we develop\nMedEvalKit, a unified evaluation framework that consolidates leading multimodal\nand textual medical benchmarks for standardized, fair, and efficient model\nassessment. We evaluate the performance of Lingshu on three fundamental medical\ntasks, multimodal QA, text-based QA, and medical report generation. The results\nshow that Lingshu consistently outperforms the existing open-source multimodal\nmodels on most tasks ...","upvotes":113,"discussionId":"684795093ec10bdd8ab4de56","projectPage":"https://alibaba-damo-academy.github.io/lingshu/","ai_summary":"A medical-specialized multimodal large language model, Lingshu, is introduced with enhanced data curation and reinforcement learning to address limitations in medical applications.","ai_keywords":["Multimodal Large Language Models","MLLMs","medical knowledge","hallucinations","data curation","medical texts","general-domain data","accurate medical captions","visual question answering","VQA","reasoning capabilities","multi-stage training","medical expertise","reinforcement learning","verifiable rewards paradigm","MedEvalKit","multimodal QA","text-based QA","medical report generation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"604f67ef0fe8ff3ec13d71ef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/604f67ef0fe8ff3ec13d71ef/KhUwWvZ3OJ9nEee3B-SXO.png","isPro":false,"fullname":"Hou Pong (Ken) Chan","user":"kenchan0226","type":"user"},{"_id":"64b7cd74ff6d81ae297feded","avatarUrl":"/avatars/880fbc96cc093f5e901ce84f32a1d21d.svg","isPro":false,"fullname":"ZHANG HAO","user":"26hzhang","type":"user"},{"_id":"64118689756b9e455c7eac62","avatarUrl":"/avatars/cdb3da22593facf545a0bafbf548b07e.svg","isPro":false,"fullname":"Xu Weiwen","user":"xww033","type":"user"},{"_id":"642eecbf9b2484d7d8526781","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642eecbf9b2484d7d8526781/4IvGbd66s49Wx5pZyZGHA.png","isPro":false,"fullname":"Yu Rong","user":"Swrooy","type":"user"},{"_id":"6365d83ce7a78348d82572b0","avatarUrl":"/avatars/d50587902cced2c3640fd5de82ff78dd.svg","isPro":false,"fullname":"ll","user":"jianghuyihei","type":"user"},{"_id":"67a5a25269f568c7eb4173cd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/IFzcHm_K8s2UxTRCC79Xf.png","isPro":false,"fullname":"Tingyang Xu","user":"xuty007","type":"user"},{"_id":"66fa54b9076c1a309d563a41","avatarUrl":"/avatars/2daaa10c4c5bfcde74c7f995d15be1e0.svg","isPro":false,"fullname":"Ruifen Yu","user":"csyrf","type":"user"},{"_id":"64e85b3edb3767299865e0e3","avatarUrl":"/avatars/fdbe121535dea940edd2766161393485.svg","isPro":false,"fullname":"Chen","user":"Guizhen","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"61657b0b20606e5e73f611cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61657b0b20606e5e73f611cc/6ZPne2GYlWkxrx35ND1P8.png","isPro":false,"fullname":"CHAOQUN LIU","user":"lukecq","type":"user"},{"_id":"665d93664dcff58f9a737a5f","avatarUrl":"/avatars/c0b98fe9af99583fc09124aa17ac5081.svg","isPro":false,"fullname":"jtan","user":"jtan951102","type":"user"},{"_id":"63913b120cf6b11c487ca31d","avatarUrl":"/avatars/aec44edd5470dd6e767e0a25efd6fb5d.svg","isPro":true,"fullname":"Xin Li","user":"lixin4ever","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2506.07044

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Published on Jun 8, 2025

· Submitted by

Hou Pong (Ken) Chan on Jun 10, 2025

Upvote

113

Authors:

Weiwen Xu ,

Hou Pong Chan ,

Long Li ,

Jianyu Wang ,

Chenghao Xiao ,

Yu Sun ,

Hao Zhang ,

Yu Rong

Abstract

A medical-specialized multimodal large language model, Lingshu, is introduced with enhanced data curation and reinforcement learning to address limitations in medical applications.

AI-generated summary

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

View arXiv page View PDF Project page Add to collection

Community

kenchan0226

Paper author Paper submitter Jun 10, 2025

🌟 Highlights:

Lingshu supports more than 12 medical imaging modalities, including X-Ray, CT Scan, MRI, Microscopy, Ultrasound, Histopathology, Dermoscopy, Fundus, OCT, Digital Photography, Endoscopy, and PET.
Lingshu models achieve SOTA on most medical multimodal/textual QA and report generation tasks for 7B and 32 model sizes.
Lingshu-32B outperforms GPT-4.1 and Claude Sonnet 4 in most multimodal QA and report generation tasks.