Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - VHELM: A Holistic Evaluation of Vision Language Models
https://crfm.stanford.edu/helm/vhelm/v2.0.1/\n","updatedAt":"2024-10-10T18:44:33.495Z","author":{"_id":"604ae011caabafacfa48e3de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1615519738679-noauth.jpeg","fullname":"Haoqin Tu","name":"PahaII","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"editors":["PahaII"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1615519738679-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"6709d33769a07aa53e6cc058","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-10-12T01:39:03.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [DARE: Diverse Visual Question Answering with Robustness Evaluation](https://huggingface.co/papers/2409.18023) (2024)\n* [A Survey on Multimodal Benchmarks: In the Era of Large AI Models](https://huggingface.co/papers/2409.18142) (2024)\n* [A Survey on Benchmarks of Multimodal Large Language Models](https://huggingface.co/papers/2408.08632) (2024)\n* [A Survey on Evaluation of Multimodal Large Language Models](https://huggingface.co/papers/2408.15769) (2024)\n* [JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images](https://huggingface.co/papers/2409.12953) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-10-12T01:39:03.347Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2410.07112","authors":[{"_id":"6708205234300d85742d61b1","name":"Tony Lee","hidden":false},{"_id":"6708205234300d85742d61b2","user":{"_id":"604ae011caabafacfa48e3de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1615519738679-noauth.jpeg","isPro":false,"fullname":"Haoqin Tu","user":"PahaII","type":"user"},"name":"Haoqin Tu","status":"claimed_verified","statusLastChangedAt":"2024-12-02T08:51:40.117Z","hidden":false},{"_id":"6708205234300d85742d61b3","name":"Chi Heem Wong","hidden":false},{"_id":"6708205234300d85742d61b4","name":"Wenhao Zheng","hidden":false},{"_id":"6708205234300d85742d61b5","name":"Yiyang Zhou","hidden":false},{"_id":"6708205234300d85742d61b6","name":"Yifan Mai","hidden":false},{"_id":"6708205234300d85742d61b7","name":"Josselin Somerville Roberts","hidden":false},{"_id":"6708205234300d85742d61b8","user":{"_id":"621e9388345a1d9ab65391c3","avatarUrl":"/avatars/d009c811dc63224f8eba1889fefad04b.svg","isPro":false,"fullname":"Michihiro Yasunaga","user":"michiyasunaga","type":"user"},"name":"Michihiro Yasunaga","status":"claimed_verified","statusLastChangedAt":"2025-02-21T16:39:14.717Z","hidden":false},{"_id":"6708205234300d85742d61b9","name":"Huaxiu Yao","hidden":false},{"_id":"6708205234300d85742d61ba","user":{"_id":"645eb61da3c5cd8a16efffff","avatarUrl":"/avatars/9112bfeed598dfabf9e077e69e09ecc9.svg","isPro":false,"fullname":"Cihang Xie","user":"cihangxie","type":"user"},"name":"Cihang Xie","status":"claimed_verified","statusLastChangedAt":"2025-07-31T08:17:16.010Z","hidden":false},{"_id":"6708205234300d85742d61bb","name":"Percy Liang","hidden":false}],"publishedAt":"2024-10-09T17:46:34.000Z","submittedOnDailyAt":"2024-10-10T17:14:33.459Z","title":"VHELM: A Holistic Evaluation of Vision Language Models","submittedOnDailyBy":{"_id":"604ae011caabafacfa48e3de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1615519738679-noauth.jpeg","isPro":false,"fullname":"Haoqin Tu","user":"PahaII","type":"user"},"summary":"Current benchmarks for assessing vision-language models (VLMs) often focus on\ntheir perception or problem-solving capabilities and neglect other critical\naspects such as fairness, multilinguality, or toxicity. Furthermore, they\ndiffer in their evaluation procedures and the scope of the evaluation, making\nit difficult to compare models. To address these issues, we extend the HELM\nframework to VLMs to present the Holistic Evaluation of Vision Language Models\n(VHELM). VHELM aggregates various datasets to cover one or more of the 9\naspects: visual perception, knowledge, reasoning, bias, fairness,\nmultilinguality, robustness, toxicity, and safety. In doing so, we produce a\ncomprehensive, multi-dimensional view of the capabilities of the VLMs across\nthese important factors. In addition, we standardize the standard inference\nparameters, methods of prompting, and evaluation metrics to enable fair\ncomparisons across models. Our framework is designed to be lightweight and\nautomatic so that evaluation runs are cheap and fast. Our initial run evaluates\n22 VLMs on 21 existing datasets to provide a holistic snapshot of the models.\nWe uncover new key findings, such as the fact that efficiency-focused models\n(e.g., Claude 3 Haiku or Gemini 1.5 Flash) perform significantly worse than\ntheir full models (e.g., Claude 3 Opus or Gemini 1.5 Pro) on the bias benchmark\nbut not when evaluated on the other aspects. For transparency, we release the\nraw model generations and complete results on our website\n(https://crfm.stanford.edu/helm/vhelm/v2.0.1). VHELM is intended to be a living\nbenchmark, and we hope to continue adding new datasets and models over time.","upvotes":3,"discussionId":"6708205434300d85742d6238","ai_summary":"VHELM extends the HELM framework to evaluate vision-language models across multiple critical aspects, including fairness, multilinguality, bias, and toxicity, using standardized inference parameters and evaluation metrics.","ai_keywords":["vision-language models","HELM framework","Holistic Evaluation of Vision Language Models","VHELM","visual perception","knowledge","reasoning","bias","fairness","multilinguality","robustness","toxicity","safety","datasets","standard inference parameters","prompting methods","evaluation metrics"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"604ae011caabafacfa48e3de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1615519738679-noauth.jpeg","isPro":false,"fullname":"Haoqin Tu","user":"PahaII","type":"user"},{"_id":"668cd4bbe990292e5f6974d3","avatarUrl":"/avatars/d1747b2372e94500ecb5fb56809b482d.svg","isPro":false,"fullname":"Jinyeong Kim","user":"rubatoyeong","type":"user"},{"_id":"626237d9bbcbd1c34f1bb231","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/626237d9bbcbd1c34f1bb231/EJrOjvAL-68qMCYdnvOrq.png","isPro":true,"fullname":"Ali El Filali","user":"alielfilali01","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
VHELM extends the HELM framework to evaluate vision-language models across multiple critical aspects, including fairness, multilinguality, bias, and toxicity, using standardized inference parameters and evaluation metrics.
AI-generated summary
Current benchmarks for assessing vision-language models (VLMs) often focus on
their perception or problem-solving capabilities and neglect other critical
aspects such as fairness, multilinguality, or toxicity. Furthermore, they
differ in their evaluation procedures and the scope of the evaluation, making
it difficult to compare models. To address these issues, we extend the HELM
framework to VLMs to present the Holistic Evaluation of Vision Language Models
(VHELM). VHELM aggregates various datasets to cover one or more of the 9
aspects: visual perception, knowledge, reasoning, bias, fairness,
multilinguality, robustness, toxicity, and safety. In doing so, we produce a
comprehensive, multi-dimensional view of the capabilities of the VLMs across
these important factors. In addition, we standardize the standard inference
parameters, methods of prompting, and evaluation metrics to enable fair
comparisons across models. Our framework is designed to be lightweight and
automatic so that evaluation runs are cheap and fast. Our initial run evaluates
22 VLMs on 21 existing datasets to provide a holistic snapshot of the models.
We uncover new key findings, such as the fact that efficiency-focused models
(e.g., Claude 3 Haiku or Gemini 1.5 Flash) perform significantly worse than
their full models (e.g., Claude 3 Opus or Gemini 1.5 Pro) on the bias benchmark
but not when evaluated on the other aspects. For transparency, we release the
raw model generations and complete results on our website
(https://crfm.stanford.edu/helm/vhelm/v2.0.1). VHELM is intended to be a living
benchmark, and we hope to continue adding new datasets and models over time.
The VHELM benchmark stands out from other benchmarks by explicitly offering evaluation aspects across a wide range of scenarios, while ensuring full transparency with model outputs, codes, and data. Check our website: https://crfm.stanford.edu/helm/vhelm/v2.0.1/