Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - UPME: An Unsupervised Peer Review Framework for Multimodal Large
Language Model Evaluation
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-04-02T01:36:24.877Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6818743348121643},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.14941","authors":[{"_id":"67eb932522a341478ae86cb6","user":{"_id":"67a99d1fef1439e285c4cbec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/VrwUmrY2wsg4sVSIMc--K.png","isPro":false,"fullname":"Qihui Zhang","user":"77Hui","type":"user"},"name":"Qihui Zhang","status":"claimed_verified","statusLastChangedAt":"2025-04-01T07:46:00.373Z","hidden":false},{"_id":"67eb932522a341478ae86cb7","user":{"_id":"65e14c28b1a6de8a71e70172","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65e14c28b1a6de8a71e70172/D097SILGsqoufpp3sG8tV.jpeg","isPro":false,"fullname":"Munan Ning","user":"MunanNing","type":"user"},"name":"Munan Ning","status":"admin_assigned","statusLastChangedAt":"2025-04-01T08:18:39.336Z","hidden":false},{"_id":"67eb932522a341478ae86cb8","name":"Zheyuan Liu","hidden":false},{"_id":"67eb932522a341478ae86cb9","name":"Yanbo Wang","hidden":false},{"_id":"67eb932522a341478ae86cba","name":"Jiayi Ye","hidden":false},{"_id":"67eb932522a341478ae86cbb","user":{"_id":"6637443ecd9097ac3c996d3c","avatarUrl":"/avatars/d1c38bf03c2517ba0a7004b2f9f9bc96.svg","isPro":false,"fullname":"yue","user":"yuehuang","type":"user"},"name":"Yue Huang","status":"admin_assigned","statusLastChangedAt":"2025-04-01T08:19:26.627Z","hidden":false},{"_id":"67eb932522a341478ae86cbc","name":"Shuo Yang","hidden":false},{"_id":"67eb932522a341478ae86cbd","name":"Xiao Chen","hidden":false},{"_id":"67eb932522a341478ae86cbe","user":{"_id":"62c51800cb7033fd49b8efb7","avatarUrl":"/avatars/06c2be0015f8022f9912f2279f2b3597.svg","isPro":false,"fullname":"Song","user":"Yibing","type":"user"},"name":"Yibing Song","status":"admin_assigned","statusLastChangedAt":"2025-04-01T08:19:07.616Z","hidden":false},{"_id":"67eb932522a341478ae86cbf","user":{"_id":"66135a5e50350afe76beebce","avatarUrl":"/avatars/370a4b83949355feb050c2cb0425c264.svg","isPro":false,"fullname":"yl2488","user":"yl2488","type":"user"},"name":"Li Yuan","status":"claimed_verified","statusLastChangedAt":"2025-04-04T07:09:54.399Z","hidden":false}],"publishedAt":"2025-03-19T07:15:41.000Z","submittedOnDailyAt":"2025-04-01T05:48:16.581Z","title":"UPME: An Unsupervised Peer Review Framework for Multimodal Large\n Language Model Evaluation","submittedOnDailyBy":{"_id":"67a99d1fef1439e285c4cbec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/VrwUmrY2wsg4sVSIMc--K.png","isPro":false,"fullname":"Qihui Zhang","user":"77Hui","type":"user"},"summary":"Multimodal Large Language Models (MLLMs) have emerged to tackle the\nchallenges of Visual Question Answering (VQA), sparking a new research focus on\nconducting objective evaluations of these models. Existing evaluation methods\nface limitations due to the significant human workload required to design Q&A\npairs for visual images, which inherently restricts the scale and scope of\nevaluations. Although automated MLLM-as-judge approaches attempt to reduce the\nhuman workload through automatic evaluations, they often introduce biases. To\naddress these problems, we propose an Unsupervised Peer review MLLM Evaluation\nframework. It utilizes only image data, allowing models to automatically\ngenerate questions and conduct peer review assessments of answers from other\nmodels, effectively alleviating the reliance on human workload. Additionally,\nwe introduce the vision-language scoring system to mitigate the bias issues,\nwhich focuses on three aspects: (i) response correctness; (ii) visual\nunderstanding and reasoning; and (iii) image-text correlation. Experimental\nresults demonstrate that UPME achieves a Pearson correlation of 0.944 with\nhuman evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset,\nindicating that our framework closely aligns with human-designed benchmarks and\ninherent human preferences.","upvotes":5,"discussionId":"67eb932622a341478ae86d15","ai_summary":"Unsupervised Peer Review MLLM Evaluation framework uses image data to automatically generate questions and assess answers, reducing human workload and mitigating bias.","ai_keywords":["Multimodal Large Language Models","Visual Question Answering","MLLM-as-judge","Unsupervised Peer review","vision-language scoring system","response correctness","visual understanding","reasoning","image-text correlation","Pearson correlation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67a99d1fef1439e285c4cbec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/VrwUmrY2wsg4sVSIMc--K.png","isPro":false,"fullname":"Qihui Zhang","user":"77Hui","type":"user"},{"_id":"64049ae20ab5e22719f35103","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678023295407-noauth.jpeg","isPro":false,"fullname":"Dongyu Yan","user":"StarYDY","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64b4edbe25882acb62f6d812","avatarUrl":"/avatars/99f72671586a08b1c72fc402424c668a.svg","isPro":false,"fullname":"刘哲源","user":"lzy2233","type":"user"},{"_id":"67ee275be7defc1b86506110","avatarUrl":"/avatars/0836ab7ebab1e2b4a6bc914ace440783.svg","isPro":false,"fullname":"Zhao Liyang","user":"lyzhao2000","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Unsupervised Peer Review MLLM Evaluation framework uses image data to automatically generate questions and assess answers, reducing human workload and mitigating bias.
AI-generated summary
Multimodal Large Language Models (MLLMs) have emerged to tackle the
challenges of Visual Question Answering (VQA), sparking a new research focus on
conducting objective evaluations of these models. Existing evaluation methods
face limitations due to the significant human workload required to design Q&A
pairs for visual images, which inherently restricts the scale and scope of
evaluations. Although automated MLLM-as-judge approaches attempt to reduce the
human workload through automatic evaluations, they often introduce biases. To
address these problems, we propose an Unsupervised Peer review MLLM Evaluation
framework. It utilizes only image data, allowing models to automatically
generate questions and conduct peer review assessments of answers from other
models, effectively alleviating the reliance on human workload. Additionally,
we introduce the vision-language scoring system to mitigate the bias issues,
which focuses on three aspects: (i) response correctness; (ii) visual
understanding and reasoning; and (iii) image-text correlation. Experimental
results demonstrate that UPME achieves a Pearson correlation of 0.944 with
human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset,
indicating that our framework closely aligns with human-designed benchmarks and
inherent human preferences.