Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications
[go: Go Back, main page]

https://huggingface.co/docs/hub/en/spaces-overview.

\n

For the dataset, here's a guide on making it available on the hub: https://huggingface.co/docs/datasets/loading.

\n

Let me know whether you need any help!

\n

Cheers,

\n

Niels from HF

\n","updatedAt":"2024-09-21T08:10:56.215Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8375610709190369},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[{"reaction":"👍","users":["pkanithi","cchristophe","mpimentel","maslenkovas"],"count":4}],"isReport":false,"parentCommentId":"66e28ce73ed3f96128692054"}},{"id":"678e25ce50615b4f44d3d30e","author":{"_id":"5f5f6c113c67af20d9945afb","avatarUrl":"/avatars/06b2eb3a5d27864280d4d02e6d00d782.svg","fullname":"Tathagata Raha","name":"tathagataraha","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false},"createdAt":"2025-01-20T10:30:38.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Hi, the MEDIC leaderboard is now live in HF Spaces. https://huggingface.co/spaces/m42-health/MEDIC-Benchmark","html":"

Hi, the MEDIC leaderboard is now live in HF Spaces. https://huggingface.co/spaces/m42-health/MEDIC-Benchmark

\n","updatedAt":"2025-01-20T10:30:47.392Z","author":{"_id":"5f5f6c113c67af20d9945afb","avatarUrl":"/avatars/06b2eb3a5d27864280d4d02e6d00d782.svg","fullname":"Tathagata Raha","name":"tathagataraha","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7262905240058899},"editors":["tathagataraha"],"editorAvatarUrls":["/avatars/06b2eb3a5d27864280d4d02e6d00d782.svg"],"reactions":[{"reaction":"👍","users":["cchristophe"],"count":1}],"isReport":false,"parentCommentId":"66e28ce73ed3f96128692054"}}]},{"id":"66e3965a0226c0369acb8fc2","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-09-13T01:33:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Towards Reliable Medical Question Answering: Techniques and Challenges in Mitigating Hallucinations in Language Models](https://huggingface.co/papers/2408.13808) (2024)\n* [Med42-v2: A Suite of Clinical LLMs](https://huggingface.co/papers/2408.06142) (2024)\n* [CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare](https://huggingface.co/papers/2407.19705) (2024)\n* [GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI](https://huggingface.co/papers/2408.03361) (2024)\n* [Towards Evaluating and Building Versatile Large Language Models for Medicine](https://huggingface.co/papers/2408.12547) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-09-13T01:33:14.834Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7673478722572327},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2409.07314","authors":[{"_id":"66e263843b7e36b011d0f877","user":{"_id":"65280984b794fe3d06544d77","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65280984b794fe3d06544d77/tyrxbxtDG02On1uiRaVbL.jpeg","isPro":false,"fullname":"Praveenkumar","user":"pkanithi","type":"user"},"name":"Praveen K Kanithi","status":"claimed_verified","statusLastChangedAt":"2024-09-12T07:12:02.324Z","hidden":false},{"_id":"66e263843b7e36b011d0f878","user":{"_id":"628e39f4b1596566033b8d7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/628e39f4b1596566033b8d7b/-Y807up1cgMmAQsczdOPn.jpeg","isPro":false,"fullname":"Clément Christophe","user":"cchristophe","type":"user"},"name":"Clément Christophe","status":"admin_assigned","statusLastChangedAt":"2024-09-12T07:13:15.223Z","hidden":false},{"_id":"66e263843b7e36b011d0f879","user":{"_id":"6454faafa13edf669cd74f36","avatarUrl":"/avatars/355d16e28ca9cf5891368e43bcda6de5.svg","isPro":false,"fullname":"Marco Pimentel","user":"mpimentel","type":"user"},"name":"Marco AF Pimentel","status":"admin_assigned","statusLastChangedAt":"2024-09-12T07:17:25.568Z","hidden":false},{"_id":"66e263843b7e36b011d0f87a","user":{"_id":"5f5f6c113c67af20d9945afb","avatarUrl":"/avatars/06b2eb3a5d27864280d4d02e6d00d782.svg","isPro":false,"fullname":"Tathagata Raha","user":"tathagataraha","type":"user"},"name":"Tathagata Raha","status":"admin_assigned","statusLastChangedAt":"2024-09-12T07:17:31.862Z","hidden":false},{"_id":"66e263843b7e36b011d0f87b","user":{"_id":"66bb35988b09ede0b7b92313","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66bb35988b09ede0b7b92313/M06mQ3ifyRwuladTNwMS2.png","isPro":false,"fullname":"Nada Saadi","user":"Nadas31","type":"user"},"name":"Nada Saadi","status":"admin_assigned","statusLastChangedAt":"2024-09-12T07:17:37.373Z","hidden":false},{"_id":"66e263843b7e36b011d0f87c","user":{"_id":"65e9aea6a00ce37e228c1a48","avatarUrl":"/avatars/af724095ec8d60e7ecbf3ae097419d93.svg","isPro":false,"fullname":"Hamza Javed","user":"hjaved202","type":"user"},"name":"Hamza Javed","status":"claimed_verified","statusLastChangedAt":"2024-09-12T13:30:27.419Z","hidden":false},{"_id":"66e263843b7e36b011d0f87d","user":{"_id":"6506cfafd55dd4e15caeea09","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/uTkC6G8Cj51av4i7kAaI8.png","isPro":false,"fullname":"Svetlana Maslenkova","user":"maslenkovas","type":"user"},"name":"Svetlana Maslenkova","status":"admin_assigned","statusLastChangedAt":"2024-09-12T07:18:03.748Z","hidden":false},{"_id":"66e263843b7e36b011d0f87e","user":{"_id":"647437a26a972f252de6b0ce","avatarUrl":"/avatars/02e6bed173eee14a18e30e0d247b8aa1.svg","isPro":false,"fullname":"Nasir Hayat","user":"nasirhayat","type":"user"},"name":"Nasir Hayat","status":"admin_assigned","statusLastChangedAt":"2024-09-12T07:18:15.965Z","hidden":false},{"_id":"66e263843b7e36b011d0f87f","user":{"_id":"65281d6ef61ca80b9c2ee707","avatarUrl":"/avatars/090ea7210a4bb6549b0f7fee71525625.svg","isPro":false,"fullname":"Ronnie Rajan","user":"ronnierajan","type":"user"},"name":"Ronnie Rajan","status":"admin_assigned","statusLastChangedAt":"2024-09-12T07:18:22.360Z","hidden":false},{"_id":"66e263843b7e36b011d0f880","name":"Shadab Khan","hidden":false}],"publishedAt":"2024-09-11T14:44:51.000Z","submittedOnDailyAt":"2024-09-12T02:21:07.288Z","title":"MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical\n Applications","submittedOnDailyBy":{"_id":"65280984b794fe3d06544d77","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65280984b794fe3d06544d77/tyrxbxtDG02On1uiRaVbL.jpeg","isPro":false,"fullname":"Praveenkumar","user":"pkanithi","type":"user"},"summary":"The rapid development of Large Language Models (LLMs) for healthcare\napplications has spurred calls for holistic evaluation beyond frequently-cited\nbenchmarks like USMLE, to better reflect real-world performance. While\nreal-world assessments are valuable indicators of utility, they often lag\nbehind the pace of LLM evolution, likely rendering findings obsolete upon\ndeployment. This temporal disconnect necessitates a comprehensive upfront\nevaluation that can guide model selection for specific clinical applications.\nWe introduce MEDIC, a framework assessing LLMs across five critical dimensions\nof clinical competence: medical reasoning, ethics and bias, data and language\nunderstanding, in-context learning, and clinical safety. MEDIC features a novel\ncross-examination framework quantifying LLM performance across areas like\ncoverage and hallucination detection, without requiring reference outputs. We\napply MEDIC to evaluate LLMs on medical question-answering, safety,\nsummarization, note generation, and other tasks. Our results show performance\ndisparities across model sizes, baseline vs medically finetuned models, and\nhave implications on model selection for applications requiring specific model\nstrengths, such as low hallucination or lower cost of inference. MEDIC's\nmultifaceted evaluation reveals these performance trade-offs, bridging the gap\nbetween theoretical capabilities and practical implementation in healthcare\nsettings, ensuring that the most promising models are identified and adapted\nfor diverse healthcare applications.","upvotes":56,"discussionId":"66e263853b7e36b011d0f8d6","ai_summary":"MEDIC framework evaluates Large Language Models across five clinical dimensions to guide model selection in healthcare applications, identifying performance trade-offs and ensuring practical implementation.","ai_keywords":["Large Language Models","MEDIC","medical reasoning","ethics and bias","data and language understanding","in-context learning","clinical safety","cross-examination framework","hallucination detection","medical question-answering","safety","summarization","note generation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65280984b794fe3d06544d77","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65280984b794fe3d06544d77/tyrxbxtDG02On1uiRaVbL.jpeg","isPro":false,"fullname":"Praveenkumar","user":"pkanithi","type":"user"},{"_id":"628e39f4b1596566033b8d7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/628e39f4b1596566033b8d7b/-Y807up1cgMmAQsczdOPn.jpeg","isPro":false,"fullname":"Clément Christophe","user":"cchristophe","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"647437a26a972f252de6b0ce","avatarUrl":"/avatars/02e6bed173eee14a18e30e0d247b8aa1.svg","isPro":false,"fullname":"Nasir Hayat","user":"nasirhayat","type":"user"},{"_id":"65281d6ef61ca80b9c2ee707","avatarUrl":"/avatars/090ea7210a4bb6549b0f7fee71525625.svg","isPro":false,"fullname":"Ronnie Rajan","user":"ronnierajan","type":"user"},{"_id":"6454faafa13edf669cd74f36","avatarUrl":"/avatars/355d16e28ca9cf5891368e43bcda6de5.svg","isPro":false,"fullname":"Marco Pimentel","user":"mpimentel","type":"user"},{"_id":"627618352c670b8e9d796078","avatarUrl":"/avatars/1e3da46e5f46bf24d3d59df842d53536.svg","isPro":false,"fullname":"Ajit Brundavanam","user":"ajaxbru","type":"user"},{"_id":"639c379cdb7c5f35004066cb","avatarUrl":"/avatars/3e435506ee85aa7d2d0ec2174a07462f.svg","isPro":false,"fullname":"Zhenran Xu","user":"imryanxu","type":"user"},{"_id":"5f5f6c113c67af20d9945afb","avatarUrl":"/avatars/06b2eb3a5d27864280d4d02e6d00d782.svg","isPro":false,"fullname":"Tathagata Raha","user":"tathagataraha","type":"user"},{"_id":"66bb6a2e412e585c314c7096","avatarUrl":"/avatars/76752be111d221a109d842edd5225917.svg","isPro":false,"fullname":"Wengao Zhu","user":"wengaozhu","type":"user"},{"_id":"64687c86e134d050a589160a","avatarUrl":"/avatars/c37638c4c35b5a54c1ed9fe6ffb1cf14.svg","isPro":true,"fullname":"ADSK 32","user":"skad32","type":"user"},{"_id":"65252ec6a6c9cbd7e92f1b17","avatarUrl":"/avatars/b7d535f15a599f4b5c9ea34802f71fc5.svg","isPro":false,"fullname":"Khan","user":"skhanshadab","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">
Papers
arxiv:2409.07314

MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

Published on Sep 11, 2024
· Submitted by
Praveenkumar
on Sep 12, 2024
#2 Paper of the day

Abstract

MEDIC framework evaluates Large Language Models across five clinical dimensions to guide model selection in healthcare applications, identifying performance trade-offs and ensuring practical implementation.

AI-generated summary

The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC's multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications.

Community

Paper author Paper submitter

Unlike traditional MCQ benchmarks, the MEDIC framework is designed to evaluate LLMs across five key clinical dimensions, providing a more comprehensive assessment of their real-world applicability and effectiveness.

thanks for the paper, are you sharing code?

·
Paper author

Hi, we plan to offer an open leaderboard for everyone to participate in. Additionally, we'll be sharing a subset of our evaluation datasets.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.07314 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 8