Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Lingshu: A Generalist Foundation Model for Unified Multimodal Medical
Understanding and Reasoning
Lingshu-32B outperforms GPT-4.1 and Claude Sonnet 4 in most multimodal QA and report generation tasks.\n\n","updatedAt":"2025-06-10T02:18:48.102Z","author":{"_id":"604f67ef0fe8ff3ec13d71ef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/604f67ef0fe8ff3ec13d71ef/KhUwWvZ3OJ9nEee3B-SXO.png","fullname":"Hou Pong (Ken) Chan","name":"kenchan0226","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":16,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.796971321105957},"editors":["kenchan0226"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/604f67ef0fe8ff3ec13d71ef/KhUwWvZ3OJ9nEee3B-SXO.png"],"reactions":[],"isReport":false}},{"id":"6847cf4bd0ad6bf06f935ddc","author":{"_id":"641433e38900ef6afa3095f1","avatarUrl":"/avatars/376ee167592ce03d723849f210788ebf.svg","fullname":"Henri Gelender","name":"MortFee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2025-06-10T06:23:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"\n\n","html":"
\n","updatedAt":"2025-06-10T06:23:07.604Z","author":{"_id":"641433e38900ef6afa3095f1","avatarUrl":"/avatars/376ee167592ce03d723849f210788ebf.svg","fullname":"Henri Gelender","name":"MortFee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3530321717262268},"editors":["MortFee"],"editorAvatarUrls":["/avatars/376ee167592ce03d723849f210788ebf.svg"],"reactions":[],"isReport":false}},{"id":"684a3003e40a99007ea33aa0","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-06-12T01:40:19.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MEDMKG: Benchmarking Medical Knowledge Exploitation with Multimodal Knowledge Graph](https://huggingface.co/papers/2505.17214) (2025)\n* [CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs](https://huggingface.co/papers/2505.24120) (2025)\n* [Infi-Med: Low-Resource Medical MLLMs with Robust Reasoning Evaluation](https://huggingface.co/papers/2505.23867) (2025)\n* [ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification](https://huggingface.co/papers/2504.20930) (2025)\n* [Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios](https://huggingface.co/papers/2505.23118) (2025)\n* [VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge](https://huggingface.co/papers/2504.10342) (2025)\n* [MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM](https://huggingface.co/papers/2505.24238) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-06-12T01:40:19.743Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7118139266967773},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2506.07044","authors":[{"_id":"684795093ec10bdd8ab4de43","name":"LASA Team","hidden":false},{"_id":"684795093ec10bdd8ab4de44","user":{"_id":"64118689756b9e455c7eac62","avatarUrl":"/avatars/cdb3da22593facf545a0bafbf548b07e.svg","isPro":false,"fullname":"Xu Weiwen","user":"xww033","type":"user"},"name":"Weiwen Xu","status":"claimed_verified","statusLastChangedAt":"2025-06-10T08:44:07.459Z","hidden":false},{"_id":"684795093ec10bdd8ab4de45","user":{"_id":"604f67ef0fe8ff3ec13d71ef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/604f67ef0fe8ff3ec13d71ef/KhUwWvZ3OJ9nEee3B-SXO.png","isPro":false,"fullname":"Hou Pong (Ken) Chan","user":"kenchan0226","type":"user"},"name":"Hou Pong Chan","status":"claimed_verified","statusLastChangedAt":"2025-06-10T08:44:05.163Z","hidden":false},{"_id":"684795093ec10bdd8ab4de46","user":{"_id":"6365d83ce7a78348d82572b0","avatarUrl":"/avatars/d50587902cced2c3640fd5de82ff78dd.svg","isPro":false,"fullname":"ll","user":"jianghuyihei","type":"user"},"name":"Long Li","status":"claimed_verified","statusLastChangedAt":"2025-07-15T19:12:01.942Z","hidden":false},{"_id":"684795093ec10bdd8ab4de47","name":"Mahani Aljunied","hidden":false},{"_id":"684795093ec10bdd8ab4de48","name":"Ruifeng Yuan","hidden":false},{"_id":"684795093ec10bdd8ab4de49","user":{"_id":"61e09ec13a1781f66b4e9ae2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1642110635503-noauth.jpeg","isPro":false,"fullname":"Jianyu Wang","user":"Jianyu","type":"user"},"name":"Jianyu Wang","status":"claimed_verified","statusLastChangedAt":"2025-06-10T08:44:03.340Z","hidden":false},{"_id":"684795093ec10bdd8ab4de4a","user":{"_id":"63108cc834c7d77420b0fd68","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63108cc834c7d77420b0fd68/taDnqEmcI9Rhe3uzcPEE3.jpeg","isPro":false,"fullname":"Chenghao Xiao","user":"gowitheflow","type":"user"},"name":"Chenghao Xiao","status":"claimed_verified","statusLastChangedAt":"2025-06-10T10:59:59.163Z","hidden":false},{"_id":"684795093ec10bdd8ab4de4b","name":"Guizhen Chen","hidden":false},{"_id":"684795093ec10bdd8ab4de4c","name":"Chaoqun Liu","hidden":false},{"_id":"684795093ec10bdd8ab4de4d","name":"Zhaodonghui Li","hidden":false},{"_id":"684795093ec10bdd8ab4de4e","user":{"_id":"6723079ad1306fe9c76a1d29","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/b4BNPCeZs59MKxo1qmT6r.png","isPro":false,"fullname":"Yu Sun","user":"YuSun-AI","type":"user"},"name":"Yu Sun","status":"claimed_verified","statusLastChangedAt":"2025-06-11T08:36:05.334Z","hidden":false},{"_id":"684795093ec10bdd8ab4de4f","name":"Junao Shen","hidden":false},{"_id":"684795093ec10bdd8ab4de50","name":"Chaojun Wang","hidden":false},{"_id":"684795093ec10bdd8ab4de51","name":"Jie Tan","hidden":false},{"_id":"684795093ec10bdd8ab4de52","name":"Deli Zhao","hidden":false},{"_id":"684795093ec10bdd8ab4de53","name":"Tingyang Xu","hidden":false},{"_id":"684795093ec10bdd8ab4de54","user":{"_id":"64b7cd74ff6d81ae297feded","avatarUrl":"/avatars/880fbc96cc093f5e901ce84f32a1d21d.svg","isPro":false,"fullname":"ZHANG HAO","user":"26hzhang","type":"user"},"name":"Hao Zhang","status":"claimed_verified","statusLastChangedAt":"2025-06-10T11:20:59.561Z","hidden":false},{"_id":"684795093ec10bdd8ab4de55","user":{"_id":"642eecbf9b2484d7d8526781","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642eecbf9b2484d7d8526781/4IvGbd66s49Wx5pZyZGHA.png","isPro":false,"fullname":"Yu Rong","user":"Swrooy","type":"user"},"name":"Yu Rong","status":"claimed_verified","statusLastChangedAt":"2025-06-10T08:44:01.224Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/604f67ef0fe8ff3ec13d71ef/R3ajyza5JHjd8tOwwV2ht.png"],"publishedAt":"2025-06-08T08:47:30.000Z","submittedOnDailyAt":"2025-06-10T00:48:48.080Z","title":"Lingshu: A Generalist Foundation Model for Unified Multimodal Medical\n Understanding and Reasoning","submittedOnDailyBy":{"_id":"604f67ef0fe8ff3ec13d71ef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/604f67ef0fe8ff3ec13d71ef/KhUwWvZ3OJ9nEee3B-SXO.png","isPro":false,"fullname":"Hou Pong (Ken) Chan","user":"kenchan0226","type":"user"},"summary":"Multimodal Large Language Models (MLLMs) have demonstrated impressive\ncapabilities in understanding common visual elements, largely due to their\nlarge-scale datasets and advanced training strategies. However, their\neffectiveness in medical applications remains limited due to the inherent\ndiscrepancies between data and tasks in medical scenarios and those in the\ngeneral domain. Concretely, existing medical MLLMs face the following critical\nlimitations: (1) limited coverage of medical knowledge beyond imaging, (2)\nheightened susceptibility to hallucinations due to suboptimal data curation\nprocesses, (3) lack of reasoning capabilities tailored for complex medical\nscenarios. To address these challenges, we first propose a comprehensive data\ncuration procedure that (1) efficiently acquires rich medical knowledge data\nnot only from medical imaging but also from extensive medical texts and\ngeneral-domain data; and (2) synthesizes accurate medical captions, visual\nquestion answering (VQA), and reasoning samples. As a result, we build a\nmultimodal dataset enriched with extensive medical knowledge. Building on the\ncurated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu\nundergoes multi-stage training to embed medical expertise and enhance its\ntask-solving capabilities progressively. Besides, we preliminarily explore the\npotential of applying reinforcement learning with verifiable rewards paradigm\nto enhance Lingshu's medical reasoning ability. Additionally, we develop\nMedEvalKit, a unified evaluation framework that consolidates leading multimodal\nand textual medical benchmarks for standardized, fair, and efficient model\nassessment. We evaluate the performance of Lingshu on three fundamental medical\ntasks, multimodal QA, text-based QA, and medical report generation. The results\nshow that Lingshu consistently outperforms the existing open-source multimodal\nmodels on most tasks ...","upvotes":113,"discussionId":"684795093ec10bdd8ab4de56","projectPage":"https://alibaba-damo-academy.github.io/lingshu/","ai_summary":"A medical-specialized multimodal large language model, Lingshu, is introduced with enhanced data curation and reinforcement learning to address limitations in medical applications.","ai_keywords":["Multimodal Large Language Models","MLLMs","medical knowledge","hallucinations","data curation","medical texts","general-domain data","accurate medical captions","visual question answering","VQA","reasoning capabilities","multi-stage training","medical expertise","reinforcement learning","verifiable rewards paradigm","MedEvalKit","multimodal QA","text-based QA","medical report generation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"604f67ef0fe8ff3ec13d71ef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/604f67ef0fe8ff3ec13d71ef/KhUwWvZ3OJ9nEee3B-SXO.png","isPro":false,"fullname":"Hou Pong (Ken) Chan","user":"kenchan0226","type":"user"},{"_id":"64b7cd74ff6d81ae297feded","avatarUrl":"/avatars/880fbc96cc093f5e901ce84f32a1d21d.svg","isPro":false,"fullname":"ZHANG HAO","user":"26hzhang","type":"user"},{"_id":"64118689756b9e455c7eac62","avatarUrl":"/avatars/cdb3da22593facf545a0bafbf548b07e.svg","isPro":false,"fullname":"Xu Weiwen","user":"xww033","type":"user"},{"_id":"642eecbf9b2484d7d8526781","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642eecbf9b2484d7d8526781/4IvGbd66s49Wx5pZyZGHA.png","isPro":false,"fullname":"Yu Rong","user":"Swrooy","type":"user"},{"_id":"6365d83ce7a78348d82572b0","avatarUrl":"/avatars/d50587902cced2c3640fd5de82ff78dd.svg","isPro":false,"fullname":"ll","user":"jianghuyihei","type":"user"},{"_id":"67a5a25269f568c7eb4173cd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/IFzcHm_K8s2UxTRCC79Xf.png","isPro":false,"fullname":"Tingyang Xu","user":"xuty007","type":"user"},{"_id":"66fa54b9076c1a309d563a41","avatarUrl":"/avatars/2daaa10c4c5bfcde74c7f995d15be1e0.svg","isPro":false,"fullname":"Ruifen Yu","user":"csyrf","type":"user"},{"_id":"64e85b3edb3767299865e0e3","avatarUrl":"/avatars/fdbe121535dea940edd2766161393485.svg","isPro":false,"fullname":"Chen","user":"Guizhen","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"61657b0b20606e5e73f611cc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61657b0b20606e5e73f611cc/6ZPne2GYlWkxrx35ND1P8.png","isPro":false,"fullname":"CHAOQUN LIU","user":"lukecq","type":"user"},{"_id":"665d93664dcff58f9a737a5f","avatarUrl":"/avatars/c0b98fe9af99583fc09124aa17ac5081.svg","isPro":false,"fullname":"jtan","user":"jtan951102","type":"user"},{"_id":"63913b120cf6b11c487ca31d","avatarUrl":"/avatars/aec44edd5470dd6e767e0a25efd6fb5d.svg","isPro":true,"fullname":"Xin Li","user":"lixin4ever","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A medical-specialized multimodal large language model, Lingshu, is introduced with enhanced data curation and reinforcement learning to address limitations in medical applications.
AI-generated summary
Multimodal Large Language Models (MLLMs) have demonstrated impressive
capabilities in understanding common visual elements, largely due to their
large-scale datasets and advanced training strategies. However, their
effectiveness in medical applications remains limited due to the inherent
discrepancies between data and tasks in medical scenarios and those in the
general domain. Concretely, existing medical MLLMs face the following critical
limitations: (1) limited coverage of medical knowledge beyond imaging, (2)
heightened susceptibility to hallucinations due to suboptimal data curation
processes, (3) lack of reasoning capabilities tailored for complex medical
scenarios. To address these challenges, we first propose a comprehensive data
curation procedure that (1) efficiently acquires rich medical knowledge data
not only from medical imaging but also from extensive medical texts and
general-domain data; and (2) synthesizes accurate medical captions, visual
question answering (VQA), and reasoning samples. As a result, we build a
multimodal dataset enriched with extensive medical knowledge. Building on the
curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu
undergoes multi-stage training to embed medical expertise and enhance its
task-solving capabilities progressively. Besides, we preliminarily explore the
potential of applying reinforcement learning with verifiable rewards paradigm
to enhance Lingshu's medical reasoning ability. Additionally, we develop
MedEvalKit, a unified evaluation framework that consolidates leading multimodal
and textual medical benchmarks for standardized, fair, and efficient model
assessment. We evaluate the performance of Lingshu on three fundamental medical
tasks, multimodal QA, text-based QA, and medical report generation. The results
show that Lingshu consistently outperforms the existing open-source multimodal
models on most tasks ...
Lingshu supports more than 12 medical imaging modalities, including X-Ray, CT Scan, MRI, Microscopy, Ultrasound, Histopathology, Dermoscopy, Fundus, OCT, Digital Photography, Endoscopy, and PET.
Lingshu models achieve SOTA on most medical multimodal/textual QA and report generation tasks for 7B and 32 model sizes.
Lingshu-32B outperforms GPT-4.1 and Claude Sonnet 4 in most multimodal QA and report generation tasks.