Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-02-13T01:41:57.022Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7324521541595459},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.10367","authors":[{"_id":"698e1234cace060ff123ab63","user":{"_id":"65aae63fc3fa44c7109559bb","avatarUrl":"/avatars/b3f3e5d09b410f717c07b6aea997d595.svg","isPro":false,"fullname":"Zhiling Yan","user":"JuelieYann","type":"user"},"name":"Zhiling Yan","status":"claimed_verified","statusLastChangedAt":"2026-02-13T09:37:17.425Z","hidden":false},{"_id":"698e1234cace060ff123ab64","user":{"_id":"619f01b8cc04eadf54fa5d5d","avatarUrl":"/avatars/928f3d1a6146e2e1ae4860445d929d5c.svg","isPro":false,"fullname":"Song Dingjie","user":"songdj","type":"user"},"name":"Dingjie Song","status":"claimed_verified","statusLastChangedAt":"2026-02-13T09:37:19.177Z","hidden":false},{"_id":"698e1234cace060ff123ab65","name":"Zhe Fang","hidden":false},{"_id":"698e1234cace060ff123ab66","name":"Yisheng Ji","hidden":false},{"_id":"698e1234cace060ff123ab67","name":"Xiang Li","hidden":false},{"_id":"698e1234cace060ff123ab68","name":"Quanzheng Li","hidden":false},{"_id":"698e1234cace060ff123ab69","name":"Lichao Sun","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65aae63fc3fa44c7109559bb/Np4_KL-euIgN8XivhLJdU.png","https://cdn-uploads.huggingface.co/production/uploads/65aae63fc3fa44c7109559bb/nodaBYULwdaPbpiuBJTDU.png"],"publishedAt":"2026-02-10T23:38:25.000Z","submittedOnDailyAt":"2026-02-12T19:46:29.491Z","title":"LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation","submittedOnDailyBy":{"_id":"65aae63fc3fa44c7109559bb","avatarUrl":"/avatars/b3f3e5d09b410f717c07b6aea997d595.svg","isPro":false,"fullname":"Zhiling Yan","user":"JuelieYann","type":"user"},"summary":"The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from model training data. We propose a Multi-Agent Clinical Curation Framework that filters raw data noise and validates clinical integrity against evidence-based medical principles. For evaluation, we develop an Automated Rubric-based Evaluation Framework that decomposes physician responses into granular, case-specific criteria, achieving substantially stronger alignment with expert physicians than LLM-as-a-Judge. To date, LiveMedBench comprises 2,756 real-world cases spanning 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. Extensive evaluation of 38 LLMs reveals that even the best-performing model achieves only 39.2%, and 84% of models exhibit performance degradation on post-cutoff cases, confirming pervasive data contamination risks. Error analysis further identifies contextual application-not factual knowledge-as the dominant bottleneck, with 35-48% of failures stemming from the inability to tailor medical knowledge to patient-specific constraints.","upvotes":13,"discussionId":"698e1234cace060ff123ab6a","projectPage":"https://zhilingyan.github.io/LiveMedBench/","githubRepo":"https://github.com/ZhilingYan/LiveMedBench","githubRepoAddedBy":"user","ai_summary":"LiveMedBench addresses limitations in medical LLM evaluation by providing a continuously updated, contamination-free benchmark with rubric-based evaluation that better aligns with expert clinical reasoning.","ai_keywords":["Large Language Models","medical benchmarks","data contamination","temporal misalignment","clinical reasoning","automated rubric-based evaluation","multi-agent clinical curation framework"],"githubStars":2},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65aae63fc3fa44c7109559bb","avatarUrl":"/avatars/b3f3e5d09b410f717c07b6aea997d595.svg","isPro":false,"fullname":"Zhiling Yan","user":"JuelieYann","type":"user"},{"_id":"65a52766215aabac489e3468","avatarUrl":"/avatars/fe05e22cd7e12e961296426434e17c76.svg","isPro":false,"fullname":"Lichao Sun","user":"sunlichao137","type":"user"},{"_id":"62b88694bd3b8cc946078cb5","avatarUrl":"/avatars/c7251e4426890095f7567e26e9262580.svg","isPro":false,"fullname":"liu","user":"yixin","type":"user"},{"_id":"619f01b8cc04eadf54fa5d5d","avatarUrl":"/avatars/928f3d1a6146e2e1ae4860445d929d5c.svg","isPro":false,"fullname":"Song Dingjie","user":"songdj","type":"user"},{"_id":"689fa36e4b11fd9a48c140b5","avatarUrl":"/avatars/74be328d96c9b1b3ed67076477192f94.svg","isPro":false,"fullname":"kerr","user":"john20012001","type":"user"},{"_id":"683b31eea4edab5a2943475c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/M55ts6HeGaE_kSMXrTH95.png","isPro":false,"fullname":"Dawei Liu","user":"DLPenn","type":"user"},{"_id":"66821006f8fd8217ccae1562","avatarUrl":"/avatars/abc5a07d7f764b9a5f64b3be89edf03c.svg","isPro":false,"fullname":"Xingjian Hu","user":"XingjianHu","type":"user"},{"_id":"6867470f365e2117c72ffdf2","avatarUrl":"/avatars/1bf75650dfd721ca101f55593b5e8f2c.svg","isPro":false,"fullname":"Amber Yan","user":"yanyuchu","type":"user"},{"_id":"62e376a0c4a0c0b9a0cfb773","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e376a0c4a0c0b9a0cfb773/8PKJ2YXq7sgo6lwyrQfTg.png","isPro":false,"fullname":"Kai Zhang","user":"PanaceaAI","type":"user"},{"_id":"64574d8e182c64e989846ba2","avatarUrl":"/avatars/db4bc496a745e1d7de48215b30f6fd3e.svg","isPro":false,"fullname":"Tyrannosaurus","user":"Tyrannosaurus","type":"user"},{"_id":"67ff0da83ea0149dbf3b038f","avatarUrl":"/avatars/41ac7ff5c84a64431538597554f1f8fb.svg","isPro":false,"fullname":"yh he","user":"xiaomaolv233","type":"user"},{"_id":"68b995c30b1b9b7236e1b1b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68b995c30b1b9b7236e1b1b3/pcsSz6xSFiVza_s2CJtss.jpeg","isPro":false,"fullname":"OpenSourceHealth","user":"Khyatimirani","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
LiveMedBench addresses limitations in medical LLM evaluation by providing a continuously updated, contamination-free benchmark with rubric-based evaluation that better aligns with expert clinical reasoning.
AI-generated summary
The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from model training data. We propose a Multi-Agent Clinical Curation Framework that filters raw data noise and validates clinical integrity against evidence-based medical principles. For evaluation, we develop an Automated Rubric-based Evaluation Framework that decomposes physician responses into granular, case-specific criteria, achieving substantially stronger alignment with expert physicians than LLM-as-a-Judge. To date, LiveMedBench comprises 2,756 real-world cases spanning 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. Extensive evaluation of 38 LLMs reveals that even the best-performing model achieves only 39.2%, and 84% of models exhibit performance degradation on post-cutoff cases, confirming pervasive data contamination risks. Error analysis further identifies contextual application-not factual knowledge-as the dominant bottleneck, with 35-48% of failures stemming from the inability to tailor medical knowledge to patient-specific constraints.
LiveMedBench is a continuously updated, contamination-free, and rubric-based benchmark for evaluating LLMs on real-world medical cases. It is designed to measure not only overall medical quality, but also robustness over time and alignment with physician judgment.