https://github.com/apple/axlearn/docs/research/mmau

\n","updatedAt":"2024-07-30T02:38:30.019Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9177,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5263113975524902},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"66a8af83d96a5adbccc5f258","author":{"_id":"605b7f42935268bc086131ba","avatarUrl":"/avatars/a55109f714b33f9d59d69011ddeb0b9f.svg","fullname":"Guoli Yin","name":"gyin94","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-07-30T09:16:51.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Updated link for the paper: https://github.com/apple/axlearn/tree/main/docs/research/mmau","html":"

Updated link for the paper: https://github.com/apple/axlearn/tree/main/docs/research/mmau

\n","updatedAt":"2024-07-30T09:17:54.883Z","author":{"_id":"605b7f42935268bc086131ba","avatarUrl":"/avatars/a55109f714b33f9d59d69011ddeb0b9f.svg","fullname":"Guoli Yin","name":"gyin94","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.5572794079780579},"editors":["gyin94"],"editorAvatarUrls":["/avatars/a55109f714b33f9d59d69011ddeb0b9f.svg"],"reactions":[{"reaction":"👍","users":["AdinaY","yapdianang","iedwardwangi"],"count":3}],"isReport":false},"replies":[{"id":"66adf841d59c09785e9179f8","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false},"createdAt":"2024-08-03T09:28:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @gyin94 congrats on your work!\n\nGreat to see you're planning to host the dataset on HF.\n\nWould be great to link it to this paper, see here on how to do that: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper\n\nCheers,\nNiels\nOpen-source @ HF","html":"

Hi \n\n@gyin94\n\t congrats on your work!

Great to see you're planning to host the dataset on HF.

Would be great to link it to this paper, see here on how to do that: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper

Cheers,
Niels
Open-source @ HF

\n","updatedAt":"2024-08-03T09:28:33.179Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.880965530872345},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66a8af83d96a5adbccc5f258"}}]},{"id":"66a99294cc3669ef96218286","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false},"createdAt":"2024-07-31T01:25:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [CIBench: Evaluating Your LLMs with a Code Interpreter Plugin](https://huggingface.co/papers/2407.10499) (2024)\n* [Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning](https://huggingface.co/papers/2406.06469) (2024)\n* [PyBench: Evaluating LLM Agent on various real-world coding tasks](https://huggingface.co/papers/2407.16732) (2024)\n* [MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark](https://huggingface.co/papers/2406.01574) (2024)\n* [OpenDevin: An Open Platform for AI Software Developers as Generalist Agents](https://huggingface.co/papers/2407.16741) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-07-31T01:25:40.893Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7432785630226135},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2407.18961","authors":[{"_id":"66a85220d96a5adbcca664e0","user":{"_id":"605b7f42935268bc086131ba","avatarUrl":"/avatars/a55109f714b33f9d59d69011ddeb0b9f.svg","isPro":false,"fullname":"Guoli Yin","user":"gyin94","type":"user"},"name":"Guoli Yin","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:44:13.295Z","hidden":false},{"_id":"66a85220d96a5adbcca664e1","user":{"_id":"633251c649a95639154e4f27","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1664242105182-noauth.png","isPro":false,"fullname":"Haoping (Felix) Bai","user":"bhpfelix","type":"user"},"name":"Haoping Bai","status":"claimed_verified","statusLastChangedAt":"2024-08-03T08:26:00.210Z","hidden":false},{"_id":"66a85220d96a5adbcca664e2","name":"Shuang Ma","hidden":false},{"_id":"66a85220d96a5adbcca664e3","name":"Feng Nan","hidden":false},{"_id":"66a85220d96a5adbcca664e4","user":{"_id":"6488a8e65815252b02b948c8","avatarUrl":"/avatars/bb1ff64841c75b8d331986709b0a0a93.svg","isPro":false,"fullname":"Yanchao Sun","user":"yanchao-sun","type":"user"},"name":"Yanchao Sun","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:44:48.889Z","hidden":false},{"_id":"66a85220d96a5adbcca664e5","user":{"_id":"65f968801780bc37117841c0","avatarUrl":"/avatars/b95d393d3301fff5c1c12b7de6ea04e7.svg","isPro":false,"fullname":"Zhaoyang Xu","user":"kuwernv","type":"user"},"name":"Zhaoyang Xu","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:44:55.724Z","hidden":false},{"_id":"66a85220d96a5adbcca664e6","name":"Shen Ma","hidden":false},{"_id":"66a85220d96a5adbcca664e7","user":{"_id":"64bad07938953777fe3ed5e9","avatarUrl":"/avatars/91c07a176d8d51111873751e82a4e0ea.svg","isPro":false,"fullname":"Jiarui Lu","user":"jiarui-lu","type":"user"},"name":"Jiarui Lu","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:45:19.282Z","hidden":false},{"_id":"66a85220d96a5adbcca664e8","user":{"_id":"65cfddb9844e4d8531cb6edc","avatarUrl":"/avatars/51ad35c439fa68ad5a1b4e33bc179907.svg","isPro":false,"fullname":"Xiang Kong","user":"shawnkx","type":"user"},"name":"Xiang Kong","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:45:32.157Z","hidden":false},{"_id":"66a85220d96a5adbcca664e9","user":{"_id":"65f3e68a138c6ab771434e2d","avatarUrl":"/avatars/7bfbdb1949f73b3d8f88ae2ff73900bb.svg","isPro":false,"fullname":"Aonan Zhang","user":"AonanZhang","type":"user"},"name":"Aonan Zhang","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:45:45.585Z","hidden":false},{"_id":"66a85220d96a5adbcca664ea","user":{"_id":"63cdd059ff7cd335f0e994f7","avatarUrl":"/avatars/5b7786883a8c503fbb954a74614aa524.svg","isPro":false,"fullname":"Dian Ang Yap","user":"dayap","type":"user"},"name":"Dian Ang Yap","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:45:53.095Z","hidden":false},{"_id":"66a85220d96a5adbcca664eb","user":{"_id":"5e64858c87403103f9f1055d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/96-oNO2OLVkAFOojMaV0T.jpeg","isPro":false,"fullname":"Yizhe Zhang","user":"yizhezhang","type":"user"},"name":"Yizhe zhang","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:46:21.996Z","hidden":false},{"_id":"66a85220d96a5adbcca664ec","user":{"_id":"643c19935fcffe09fb685f39","avatarUrl":"/avatars/9e837d32affad0843ea90c1b30cb67aa.svg","isPro":false,"fullname":"Karsten Ahnert","user":"headmyshoulder","type":"user"},"name":"Karsten Ahnert","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:46:28.014Z","hidden":false},{"_id":"66a85220d96a5adbcca664ed","name":"Vik Kamath","hidden":false},{"_id":"66a85220d96a5adbcca664ee","name":"Mathias Berglund","hidden":false},{"_id":"66a85220d96a5adbcca664ef","user":{"_id":"6696d534d79ce5b27da945db","avatarUrl":"/avatars/33229348a480ccd66ba63349a6f7489b.svg","isPro":false,"fullname":"Dominic Walsh","user":"Dominic3661","type":"user"},"name":"Dominic Walsh","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:46:47.327Z","hidden":false},{"_id":"66a85220d96a5adbcca664f0","name":"Tobias Gindele","hidden":false},{"_id":"66a85220d96a5adbcca664f1","name":"Juergen Wiest","hidden":false},{"_id":"66a85220d96a5adbcca664f2","user":{"_id":"66b5295f83425904fa7a1a6a","avatarUrl":"/avatars/a35568fb933ceef7451bd88fb3d5ab17.svg","isPro":false,"fullname":"Zhengfeng Lai","user":"jefflai","type":"user"},"name":"Zhengfeng Lai","status":"claimed_verified","statusLastChangedAt":"2024-08-09T07:48:27.037Z","hidden":false},{"_id":"66a85220d96a5adbcca664f3","name":"Xiaoming Wang","hidden":false},{"_id":"66a85220d96a5adbcca664f4","name":"Jiulong Shan","hidden":false},{"_id":"66a85220d96a5adbcca664f5","name":"Meng Cao","hidden":false},{"_id":"66a85220d96a5adbcca664f6","user":{"_id":"654ef8ad3fe6c0b1f871942f","avatarUrl":"/avatars/8d3689b9bf57c7c8060ac510a1839158.svg","isPro":false,"fullname":"Ruoming Pang","user":"ruoming","type":"user"},"name":"Ruoming Pang","status":"admin_assigned","statusLastChangedAt":"2024-07-30T08:56:16.520Z","hidden":false},{"_id":"66a85220d96a5adbcca664f7","name":"Zirui Wang","hidden":false}],"publishedAt":"2024-07-18T00:58:41.000Z","submittedOnDailyAt":"2024-07-30T01:08:30.010Z","title":"MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Recent advances in large language models (LLMs) have increased the demand for\ncomprehensive benchmarks to evaluate their capabilities as human-like agents.\nExisting benchmarks, while useful, often focus on specific application\nscenarios, emphasizing task completion but failing to dissect the underlying\nskills that drive these outcomes. This lack of granularity makes it difficult\nto deeply discern where failures stem from. Additionally, setting up these\nenvironments requires considerable effort, and issues of unreliability and\nreproducibility sometimes arise, especially in interactive tasks. To address\nthese limitations, we introduce the Massive Multitask Agent Understanding\n(MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need\nfor complex environment setups. It evaluates models across five domains,\nincluding teal{Tool-use}, teal{Directed Acyclic Graph\n(DAG) QA}, teal{Data Science and Machine Learning coding},\nteal{Contest-level programming} and teal{Mathematics},\nand covers five essential capabilities: orange{Understanding},\norange{Reasoning}, orange{Planning},\norange{Problem-solving}, and orange{Self-correction}.\nWith a total of 20 meticulously designed tasks encompassing over 3K distinct\nprompts, MMAU provides a comprehensive framework for evaluating the strengths\nand limitations of LLM agents. By testing 18 representative models on MMAU, we\nprovide deep and insightful analyses. Ultimately, MMAU not only sheds light on\nthe capabilities and limitations of LLM agents but also enhances the\ninterpretability of their performance. Datasets and evaluation scripts of MMAU\nare released at https://github.com/apple/axlearn/docs/research/mmau.","upvotes":40,"discussionId":"66a85223d96a5adbcca66608","ai_summary":"MMAU benchmark evaluates large language models' capabilities across multiple domains and essential skills using comprehensive offline tasks to enhance interpretability and reliability.","ai_keywords":["Large language models","MMAU benchmark","Tool-use","Directed Acyclic Graph (DAG) QA","Data Science and Machine Learning coding","Contest-level programming","Mathematics","Understanding","Reasoning","Planning","Problem-solving","Self-correction"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6597616047b8f1a2562d5f73","avatarUrl":"/avatars/aaa4b9cce5177e6a80c1f87c5d812e83.svg","isPro":false,"fullname":"Edward Wang","user":"iedwardwangi","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","isPro":false,"fullname":"Michael Barry","user":"MichaelBarryUK","type":"user"},{"_id":"63ece843c8827dd0f0f7b8c4","avatarUrl":"/avatars/e6c5840510ab88b007183cf79ed1bdf9.svg","isPro":false,"fullname":"Bangcheng Yang","user":"dumbmouse","type":"user"},{"_id":"64bbe9b236eb058cd9d6a5b9","avatarUrl":"/avatars/c7c01a3fa8809e73800392679abff6d5.svg","isPro":false,"fullname":"Kai Zuberbühler","user":"kaizuberbuehler","type":"user"},{"_id":"5f106ce5348d4c7346cd19ab","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f106ce5348d4c7346cd19ab/Uu08yZZlFuj3dtG4wld3n.jpeg","isPro":false,"fullname":"Abdullah Abdelrhim","user":"abdullah","type":"user"},{"_id":"6488a8e65815252b02b948c8","avatarUrl":"/avatars/bb1ff64841c75b8d331986709b0a0a93.svg","isPro":false,"fullname":"Yanchao Sun","user":"yanchao-sun","type":"user"},{"_id":"66a911dc8065cc9d3ca562ad","avatarUrl":"/avatars/00afeef7d04aaeeac1293f78ead24cf2.svg","isPro":false,"fullname":"Yap","user":"yapdianang","type":"user"},{"_id":"66897602959745ec8e5b7c4b","avatarUrl":"/avatars/ddee05dd0f5d6728225402609c1f691f.svg","isPro":false,"fullname":"Sherrie Walton","user":"sherriew","type":"user"},{"_id":"66897ed00501525cc0029a1e","avatarUrl":"/avatars/277194ea820539d55e2035e554cf4cf3.svg","isPro":false,"fullname":"Lina Salazar","user":"12leana","type":"user"},{"_id":"66897f980501525cc002bb66","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66897f980501525cc002bb66/eEwsFitSMsA4PJc3Ribbm.png","isPro":false,"fullname":"Chrisopher Ponce","user":"PonceChrisCanada","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2407.18961

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Published on Jul 18, 2024

· Submitted by

AK on Jul 30, 2024

Upvote

Authors:

Guoli Yin ,

Haoping Bai ,

Yanchao Sun ,

Zhaoyang Xu ,

Jiarui Lu ,

Xiang Kong ,

Aonan Zhang ,

Dian Ang Yap ,

Yizhe zhang ,

Karsten Ahnert ,

Dominic Walsh ,

Zhengfeng Lai ,

Abstract

MMAU benchmark evaluates large language models' capabilities across multiple domains and essential skills using comprehensive offline tasks to enhance interpretability and reliability.

AI-generated summary

Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including teal{Tool-use}, teal{Directed Acyclic Graph (DAG) QA}, teal{Data Science and Machine Learning coding}, teal{Contest-level programming} and teal{Mathematics}, and covers five essential capabilities: orange{Understanding}, orange{Reasoning}, orange{Planning}, orange{Problem-solving}, and orange{Self-correction}. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at https://github.com/apple/axlearn/docs/research/mmau.