Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
[go: Go Back, main page]

https://www.aimodels.fyi/papers/arxiv/mmlu-pro-more-robust-challenging-multi-task

\n","updatedAt":"2024-06-04T12:02:17.956Z","author":{"_id":"6486638da4cf2081f20c40ec","avatarUrl":"/avatars/0bc16a7447cd71ac18828a678313bd83.svg","fullname":"Mike Young","name":"mikelabs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7415881752967834},"editors":["mikelabs"],"editorAvatarUrls":["/avatars/0bc16a7447cd71ac18828a678313bd83.svg"],"reactions":[],"isReport":false}},{"id":"665f58fe9b99d1cb3fb199ad","author":{"_id":"62f847d692950415b63c6011","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660437733795-noauth.png","fullname":"Yassine Ennaour","name":"Lyte","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":34,"isUserFollowing":false},"createdAt":"2024-06-04T18:12:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro","html":"

https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro

\n","updatedAt":"2024-06-04T18:12:14.483Z","author":{"_id":"62f847d692950415b63c6011","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660437733795-noauth.png","fullname":"Yassine Ennaour","name":"Lyte","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":34,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5332772135734558},"editors":["Lyte"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1660437733795-noauth.png"],"reactions":[{"reaction":"πŸ”₯","users":["AdinaY","osanseviero"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2406.01574","authors":[{"_id":"665e92a30eb022e5364b0fae","user":{"_id":"636a35eff8d9af4aea181608","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/636a35eff8d9af4aea181608/s9GFJYd_QXVbg0Lb4JpKj.jpeg","isPro":false,"fullname":"yubo","user":"ubowang","type":"user"},"name":"Yubo Wang","status":"claimed_verified","statusLastChangedAt":"2024-06-14T07:24:48.828Z","hidden":false},{"_id":"665e92a30eb022e5364b0faf","user":{"_id":"5ec82854968f6028e0559f70","avatarUrl":"/avatars/45b58d912f7d00cb351947cd79d5eeb4.svg","isPro":true,"fullname":"Xueguang Ma","user":"MrLight","type":"user"},"name":"Xueguang Ma","status":"admin_assigned","statusLastChangedAt":"2024-06-04T07:32:53.777Z","hidden":false},{"_id":"665e92a30eb022e5364b0fb0","user":{"_id":"638efcf4c67af472d316d424","avatarUrl":"/avatars/97a57859d7d87a3a8f1bb41d32a72bc2.svg","isPro":false,"fullname":"Ge Zhang","user":"zhangysk","type":"user"},"name":"Ge Zhang","status":"admin_assigned","statusLastChangedAt":"2024-06-04T07:33:45.882Z","hidden":false},{"_id":"665e92a30eb022e5364b0fb1","user":{"_id":"64de37ee5e192985054be575","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64de37ee5e192985054be575/fVV7JQMtp_J3uFqszJJHH.jpeg","isPro":false,"fullname":"Yuansheng Ni","user":"yuanshengni","type":"user"},"name":"Yuansheng Ni","status":"admin_assigned","statusLastChangedAt":"2024-06-04T07:33:55.265Z","hidden":false},{"_id":"665e92a30eb022e5364b0fb2","user":{"_id":"6154ee656181394cc00cb990","avatarUrl":"/avatars/25d3b6911c5991b1869f5d76ca2c4069.svg","isPro":true,"fullname":"Abhranil Chandra","user":"abhranil14","type":"user"},"name":"Abhranil Chandra","status":"admin_assigned","statusLastChangedAt":"2024-06-04T07:34:21.474Z","hidden":false},{"_id":"665e92a30eb022e5364b0fb3","name":"Shiguang Guo","hidden":false},{"_id":"665e92a30eb022e5364b0fb4","user":{"_id":"64405a9d518271b0d1beea38","avatarUrl":"/avatars/b702474588fd7090773320422417a582.svg","isPro":false,"fullname":"Weiming Ren","user":"wren93","type":"user"},"name":"Weiming Ren","status":"admin_assigned","statusLastChangedAt":"2024-06-04T07:34:49.769Z","hidden":false},{"_id":"665e92a30eb022e5364b0fb5","user":{"_id":"65fb7212fc9132a2dfb80061","avatarUrl":"/avatars/7955a4c96cba4e5b25481943f7b84be1.svg","isPro":false,"fullname":"Aaran Arulraj","user":"AaranArulraj","type":"user"},"name":"Aaran Arulraj","status":"admin_assigned","statusLastChangedAt":"2024-06-04T07:35:11.005Z","hidden":false},{"_id":"665e92a30eb022e5364b0fb6","user":{"_id":"655c3953ce055ed40a7de0ba","avatarUrl":"/avatars/70aad195f6aff83279ef1b1f8859419d.svg","isPro":true,"fullname":"Xuan He","user":"hexuan21","type":"user"},"name":"Xuan He","status":"admin_assigned","statusLastChangedAt":"2024-06-04T07:35:50.254Z","hidden":false},{"_id":"665e92a30eb022e5364b0fb7","user":{"_id":"64778fb8168cb428e00f69b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64778fb8168cb428e00f69b0/D_XYg74zHG9K3HUJj_gD4.jpeg","isPro":true,"fullname":"Ziyan Jiang","user":"ziyjiang","type":"user"},"name":"Ziyan Jiang","status":"claimed_verified","statusLastChangedAt":"2024-06-24T07:33:35.456Z","hidden":false},{"_id":"665e92a30eb022e5364b0fb8","user":{"_id":"63daf95a9f2687298a110386","avatarUrl":"/avatars/7e922ca76689de8e0e0c4af350115246.svg","isPro":false,"fullname":"Tianle LI","user":"tianleliphoebe","type":"user"},"name":"Tianle Li","status":"admin_assigned","statusLastChangedAt":"2024-06-04T07:36:51.642Z","hidden":false},{"_id":"665e92a30eb022e5364b0fb9","user":{"_id":"631d760344503b7227837242","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631d760344503b7227837242/3b6JRusFX6GKJpsN9ZdeJ.png","isPro":false,"fullname":"Max Ku","user":"vinesmsuic","type":"user"},"name":"Max Ku","status":"claimed_verified","statusLastChangedAt":"2024-06-04T07:31:33.369Z","hidden":false},{"_id":"665e92a30eb022e5364b0fba","name":"Kai Wang","hidden":false},{"_id":"665e92a30eb022e5364b0fbb","user":{"_id":"6498eecc3756a91498f08f60","avatarUrl":"/avatars/d92a98535d212c73b9249aa48c717f5e.svg","isPro":false,"fullname":"Zhuang","user":"AlexZhuang","type":"user"},"name":"Alex Zhuang","status":"admin_assigned","statusLastChangedAt":"2024-06-04T07:52:15.887Z","hidden":false},{"_id":"665e92a30eb022e5364b0fbc","name":"Rongqi Fan","hidden":false},{"_id":"665e92a30eb022e5364b0fbd","user":{"_id":"6230d750d93e84e233882dbc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6230d750d93e84e233882dbc/4MGEekLW3oWzqeFWDWvIK.jpeg","isPro":false,"fullname":"Xiang Yue","user":"yuexiang96","type":"user"},"name":"Xiang Yue","status":"admin_assigned","statusLastChangedAt":"2024-06-04T07:52:37.143Z","hidden":false},{"_id":"665e92a30eb022e5364b0fbe","user":{"_id":"6313a86154e6e5d9f0f94e04","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662232951344-6313a86154e6e5d9f0f94e04.jpeg","isPro":false,"fullname":"Wenhu Chen","user":"wenhu","type":"user"},"name":"Wenhu Chen","status":"admin_assigned","statusLastChangedAt":"2024-06-04T07:52:43.625Z","hidden":false}],"publishedAt":"2024-06-03T17:53:00.000Z","submittedOnDailyAt":"2024-06-04T02:35:56.406Z","title":"MMLU-Pro: A More Robust and Challenging Multi-Task Language\n Understanding Benchmark","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"In the age of large-scale language models, benchmarks like the Massive\nMultitask Language Understanding (MMLU) have been pivotal in pushing the\nboundaries of what AI can achieve in language comprehension and reasoning\nacross diverse domains. However, as models continue to improve, their\nperformance on these benchmarks has begun to plateau, making it increasingly\ndifficult to discern differences in model capabilities. This paper introduces\nMMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven\nMMLU benchmark by integrating more challenging, reasoning-focused questions and\nexpanding the choice set from four to ten options. Additionally, MMLU-Pro\neliminates the trivial and noisy questions in MMLU. Our experimental results\nshow that MMLU-Pro not only raises the challenge, causing a significant drop in\naccuracy by 16% to 33% compared to MMLU but also demonstrates greater stability\nunder varying prompts. With 24 different prompt styles tested, the sensitivity\nof model scores to prompt variations decreased from 4-5% in MMLU to just 2% in\nMMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT)\nreasoning achieved better performance on MMLU-Pro compared to direct answering,\nwhich is in stark contrast to the findings on the original MMLU, indicating\nthat MMLU-Pro includes more complex reasoning questions. Our assessments\nconfirm that MMLU-Pro is a more discriminative benchmark to better track\nprogress in the field.","upvotes":51,"discussionId":"665e92a40eb022e5364b104f","githubRepo":"https://github.com/tiger-ai-lab/mmlu-pro","githubRepoAddedBy":"auto","ai_summary":"MMLU-Pro extends the MMLU benchmark with more challenging reasoning questions, eliminates trivial ones, and demonstrates better stability and discriminative power for language models.","ai_keywords":["Chain of Thought","reasoning-focused questions"],"githubStars":339},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6313a86154e6e5d9f0f94e04","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662232951344-6313a86154e6e5d9f0f94e04.jpeg","isPro":false,"fullname":"Wenhu Chen","user":"wenhu","type":"user"},{"_id":"62e3b00c2a10651bc1ad2e7b","avatarUrl":"/avatars/176899675532d6a1ce255b290d11df37.svg","isPro":false,"fullname":"Andrew Lapp","user":"lapp0","type":"user"},{"_id":"630412d57373aacccd88af95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670594087059-630412d57373aacccd88af95.jpeg","isPro":true,"fullname":"Yasunori Ozaki","user":"alfredplpl","type":"user"},{"_id":"631d760344503b7227837242","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631d760344503b7227837242/3b6JRusFX6GKJpsN9ZdeJ.png","isPro":false,"fullname":"Max Ku","user":"vinesmsuic","type":"user"},{"_id":"6311bca0ae8896941da24e66","avatarUrl":"/avatars/48de64894fc3c9397e26e4d6da3ff537.svg","isPro":false,"fullname":"Fynn KrΓΆger","user":"fynnkroeger","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6033c55f60e3dd96631c908d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6033c55f60e3dd96631c908d/jy7cHHCBhnlzHKGbXIbj0.jpeg","isPro":false,"fullname":"Shyam Sunder Kumar","user":"theainerd","type":"user"},{"_id":"637363c73e9650036883d99d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1668506564634-noauth.jpeg","isPro":false,"fullname":"Henning Bartsch","user":"HenningBlue","type":"user"},{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","isPro":false,"fullname":"Adina Yakefu","user":"AdinaY","type":"user"},{"_id":"6460df7a933afb0106a861a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6460df7a933afb0106a861a0/X-NXl6RlcKHdzF8xJBPER.jpeg","isPro":true,"fullname":"Karel Minarik","user":"karmiq","type":"user"},{"_id":"64747f7e33192631bacd8831","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64747f7e33192631bacd8831/dstkZJ4sHJSeqLesV5cOC.jpeg","isPro":false,"fullname":"Taufiq Dwi Purnomo","user":"taufiqdp","type":"user"},{"_id":"5dd96eb166059660ed1ee413","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/NQtzmrDdbG0H8qkZvRyGk.jpeg","isPro":true,"fullname":"Julien Chaumond","user":"julien-c","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Papers
arxiv:2406.01574

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Published on Jun 3, 2024
Β· Submitted by
AK
on Jun 4, 2024
#1 Paper of the day

Abstract

MMLU-Pro extends the MMLU benchmark with more challenging reasoning questions, eliminates trivial ones, and demonstrates better stability and discriminative power for language models.

AI-generated summary

In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.

Community

Great work on the MMLU-Pro! And congrats of the release of paper and dataset πŸŽ‰

There's a plain english summary of the paper here - feedback is welcome! https://www.aimodels.fyi/papers/arxiv/mmlu-pro-more-robust-challenging-multi-task

Sign up or log in to comment

Models citing this paper 36

Browse 36 models citing this paper

Datasets citing this paper 15

Browse 15 datasets citing this paper

Spaces citing this paper 59

Collections including this paper 7