Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
[go: Go Back, main page]

https://github.com/EvolvingLMMs-Lab/lmms-eval
LiveBench Dataset : https://huggingface.co/datasets/lmms-lab/LiveBench
LiveBench Leaderboard : https://huggingface.co/spaces/lmms-lab/LiveBench
LMMs-Eval Lite : https://huggingface.co/datasets/lmms-lab/LMMs-Eval-Lite

\n","updatedAt":"2024-07-18T02:33:20.130Z","author":{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","fullname":"kcz","name":"kcz358","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5485765933990479},"editors":["kcz358"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png"],"reactions":[],"isReport":false},"replies":[{"id":"669b6f179f0576abc61da18d","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false},"createdAt":"2024-07-20T08:02:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @kcz358 congrats on your work! Thanks for releasing artifacts on the hub.\n\nWould you be able to link them to this paper page?\n\nSee here on how to do that: https://huggingface.co/docs/hub/en/paper-pages#linking-a-paper-to-a-model-dataset-or-space\n\nCheers,\n\nNiels","html":"

Hi \n\n@kcz358\n\t congrats on your work! Thanks for releasing artifacts on the hub.

\n

Would you be able to link them to this paper page?

\n

See here on how to do that: https://huggingface.co/docs/hub/en/paper-pages#linking-a-paper-to-a-model-dataset-or-space

\n

Cheers,

\n

Niels

\n","updatedAt":"2024-07-20T08:02:31.735Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8104231357574463},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66987ef02f9cd07f44a83dad"}},{"id":"669b71ce61278f96d8fe9765","author":{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","fullname":"kcz","name":"kcz358","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false},"createdAt":"2024-07-20T08:14:06.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @nielsr , thank you for your suggestions. We have link our paper to the [LiveBench](https://huggingface.co/datasets/lmms-lab/LiveBench) and include it in our collection [here](https://huggingface.co/collections/lmms-lab/lmms-eval-661d51f70a9d678b6f43f272)\n\nCheers,\n\nKaichen","html":"

Hi \n\n@nielsr\n\t , thank you for your suggestions. We have link our paper to the LiveBench and include it in our collection here

\n

Cheers,

\n

Kaichen

\n","updatedAt":"2024-07-20T08:14:06.241Z","author":{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","fullname":"kcz","name":"kcz358","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7923917174339294},"editors":["kcz358"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png"],"reactions":[],"isReport":false,"parentCommentId":"66987ef02f9cd07f44a83dad"}}]},{"id":"6699c2569a4bf63e089cf60e","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-07-19T01:33:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation](https://huggingface.co/papers/2407.00468) (2024)\n* [VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models](https://huggingface.co/papers/2407.11691) (2024)\n* [Imp: Highly Capable Large Multimodal Models for Mobile Devices](https://huggingface.co/papers/2405.12107) (2024)\n* [MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs](https://huggingface.co/papers/2407.01509) (2024)\n* [MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs](https://huggingface.co/papers/2406.11833) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-07-19T01:33:10.245Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7341183423995972},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["shwubham"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2407.12772","authors":[{"_id":"66987d94a029d7f9e39da94f","user":{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","isPro":false,"fullname":"kcz","user":"kcz358","type":"user"},"name":"Kaichen Zhang","status":"claimed_verified","statusLastChangedAt":"2024-07-18T09:05:28.420Z","hidden":false},{"_id":"66987d94a029d7f9e39da950","user":{"_id":"62d3f7d84b0933c48f3cdd9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d3f7d84b0933c48f3cdd9c/Tab1vxtxLatWzXS8NVIyo.png","isPro":true,"fullname":"Bo Li","user":"luodian","type":"user"},"name":"Bo Li","status":"claimed_verified","statusLastChangedAt":"2024-08-09T07:48:40.807Z","hidden":false},{"_id":"66987d94a029d7f9e39da951","user":{"_id":"63565cc56d7fcf1bedb7d347","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63565cc56d7fcf1bedb7d347/XGcHP4VkO_oieA1gZ4IAX.jpeg","isPro":false,"fullname":"Zhang Peiyuan","user":"PY007","type":"user"},"name":"Peiyuan Zhang","status":"claimed_verified","statusLastChangedAt":"2024-07-18T09:05:25.903Z","hidden":false},{"_id":"66987d94a029d7f9e39da952","user":{"_id":"646e1ef5075bbcc48ddf21e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/_vJC0zeVOIvaNV2R6toqg.jpeg","isPro":false,"fullname":"Pu Fanyi","user":"pufanyi","type":"user"},"name":"Fanyi Pu","status":"claimed_verified","statusLastChangedAt":"2024-07-22T07:10:47.248Z","hidden":false},{"_id":"66987d94a029d7f9e39da953","name":"Joshua Adrian Cahyono","hidden":false},{"_id":"66987d94a029d7f9e39da954","user":{"_id":"6400ba2b261cfa61f3a00555","avatarUrl":"/avatars/1311e0b5e21b1c94d73fcaf455d3c7f7.svg","isPro":false,"fullname":"Kairui","user":"KairuiHu","type":"user"},"name":"Kairui Hu","status":"claimed_verified","statusLastChangedAt":"2025-01-24T09:08:32.574Z","hidden":false},{"_id":"66987d94a029d7f9e39da955","user":{"_id":"64f7f5b54101c731ca84ae05","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f7f5b54101c731ca84ae05/13DwdxOo3tWbxKDLd44B9.jpeg","isPro":false,"fullname":"Shuai Liu","user":"Choiszt","type":"user"},"name":"Shuai Liu","status":"claimed_verified","statusLastChangedAt":"2025-03-07T13:38:17.038Z","hidden":false},{"_id":"66987d94a029d7f9e39da956","user":{"_id":"62a993d80472c0b7f94027df","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62a993d80472c0b7f94027df/j5vp-IwLA2YBexylUHiQU.png","isPro":false,"fullname":"Zhang Yuanhan","user":"ZhangYuanhan","type":"user"},"name":"Yuanhan Zhang","status":"claimed_verified","statusLastChangedAt":"2024-07-19T11:23:46.860Z","hidden":false},{"_id":"66987d94a029d7f9e39da957","name":"Jingkang Yang","hidden":false},{"_id":"66987d94a029d7f9e39da958","name":"Chunyuan Li","hidden":false},{"_id":"66987d94a029d7f9e39da959","name":"Ziwei Liu","hidden":false}],"publishedAt":"2024-07-17T17:51:53.000Z","submittedOnDailyAt":"2024-07-18T01:03:20.124Z","title":"LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models","submittedOnDailyBy":{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","isPro":false,"fullname":"kcz","user":"kcz358","type":"user"},"summary":"The advances of large foundation models necessitate wide-coverage, low-cost,\nand zero-contamination benchmarks. Despite continuous exploration of language\nmodel evaluations, comprehensive studies on the evaluation of Large Multi-modal\nModels (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified\nand standardized multimodal benchmark framework with over 50 tasks and more\nthan 10 models to promote transparent and reproducible evaluations. Although\nLMMS-EVAL offers comprehensive coverage, we find it still falls short in\nachieving low cost and zero contamination. To approach this evaluation\ntrilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that\nemphasizes both coverage and efficiency. Additionally, we present Multimodal\nLIVEBENCH that utilizes continuously updating news and online forums to assess\nmodels' generalization abilities in the wild, featuring a low-cost and\nzero-contamination evaluation approach. In summary, our work highlights the\nimportance of considering the evaluation trilemma and provides practical\nsolutions to navigate the trade-offs in evaluating large multi-modal models,\npaving the way for more effective and reliable benchmarking of LMMs. We\nopensource our codebase and maintain leaderboard of LIVEBENCH at\nhttps://github.com/EvolvingLMMs-Lab/lmms-eval and\nhttps://huggingface.co/spaces/lmms-lab/LiveBench.","upvotes":35,"discussionId":"66987d98a029d7f9e39daac2","githubRepo":"https://github.com/evolvinglmms-lab/lmms-eval","githubRepoAddedBy":"auto","ai_summary":"LMMS-EVAL and LMMS-EVAL LITE provide frameworks for evaluating large multi-modal models with comprehensive coverage, while Multimodal LIVEBENCH assesses models' generalization using real-world data.","ai_keywords":["Large Multi-modal Models","LMMS-EVAL","LMMS-EVAL LITE","Multimodal LIVEBENCH","multimodal benchmark"],"githubStars":3701},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","isPro":false,"fullname":"Minbyul Jeong","user":"Minbyul","type":"user"},{"_id":"668cd4bbe990292e5f6974d3","avatarUrl":"/avatars/d1747b2372e94500ecb5fb56809b482d.svg","isPro":false,"fullname":"Jinyeong Kim","user":"rubatoyeong","type":"user"},{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","isPro":false,"fullname":"kcz","user":"kcz358","type":"user"},{"_id":"63565cc56d7fcf1bedb7d347","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63565cc56d7fcf1bedb7d347/XGcHP4VkO_oieA1gZ4IAX.jpeg","isPro":false,"fullname":"Zhang Peiyuan","user":"PY007","type":"user"},{"_id":"62d3f7d84b0933c48f3cdd9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d3f7d84b0933c48f3cdd9c/Tab1vxtxLatWzXS8NVIyo.png","isPro":true,"fullname":"Bo Li","user":"luodian","type":"user"},{"_id":"62ab1ac1d48b4d8b048a3473","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656826685333-62ab1ac1d48b4d8b048a3473.png","isPro":false,"fullname":"Ziwei Liu","user":"liuziwei7","type":"user"},{"_id":"646e1ef5075bbcc48ddf21e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/_vJC0zeVOIvaNV2R6toqg.jpeg","isPro":false,"fullname":"Pu Fanyi","user":"pufanyi","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"644b78959e85a62bf07655f2","avatarUrl":"/avatars/518660b7743715af57629e863a038165.svg","isPro":false,"fullname":"Dmitri Iourovitski","user":"IoDmitri","type":"user"},{"_id":"616fb788e2ad27af26561b1a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675485317568-616fb788e2ad27af26561b1a.jpeg","isPro":false,"fullname":"Xiao Xu","user":"LooperXX","type":"user"},{"_id":"6350c89759bfa9a85d434138","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666238674117-6350c89759bfa9a85d434138.jpeg","isPro":false,"fullname":"Yang Lee","user":"innovation64","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2407.12772

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Published on Jul 17, 2024
· Submitted by
kcz
on Jul 18, 2024
Authors:
Bo Li ,
,
,
,

Abstract

LMMS-EVAL and LMMS-EVAL LITE provide frameworks for evaluating large multi-modal models with comprehensive coverage, while Multimodal LIVEBENCH assesses models' generalization using real-world data.

AI-generated summary

The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LIVEBENCH at https://github.com/EvolvingLMMs-Lab/lmms-eval and https://huggingface.co/spaces/lmms-lab/LiveBench.

Community

Paper author Paper submitter

Hi @kcz358 congrats on your work! Thanks for releasing artifacts on the hub.

Would you be able to link them to this paper page?

See here on how to do that: https://huggingface.co/docs/hub/en/paper-pages#linking-a-paper-to-a-model-dataset-or-space

Cheers,

Niels

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.12772 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.12772 in a Space README.md to link it from this page.

Collections including this paper 7