Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents
[go: Go Back, main page]

@librarian-bot\n\t recommend

\n","updatedAt":"2024-01-27T08:52:07.627Z","author":{"_id":"60107b385ac3e86b3ea4fc34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1627505688463-60107b385ac3e86b3ea4fc34.jpeg","fullname":"Daniel van Strien","name":"davanstrien","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":835,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["davanstrien"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1627505688463-60107b385ac3e86b3ea4fc34.jpeg"],"reactions":[],"isReport":false}},{"id":"65b4c4407ad85ddc9c41c783","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-01-27T08:52:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Small LLMs Are Weak Tool Learners: A Multi-LLM Agent](https://huggingface.co/papers/2401.07324) (2024)\n* [WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models](https://huggingface.co/papers/2401.13919) (2024)\n* [T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step](https://huggingface.co/papers/2312.14033) (2023)\n* [AUTOACT: Automatic Agent Learning from Scratch via Self-Planning](https://huggingface.co/papers/2401.05268) (2024)\n* [R-Judge: Benchmarking Safety Risk Awareness for LLM Agents](https://huggingface.co/papers/2401.10019) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n","updatedAt":"2024-01-27T08:52:16.499Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7312968969345093},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"๐Ÿ‘","users":["davanstrien","osanseviero"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2401.13178","authors":[{"_id":"65b209fd7cd0069ad6d823ec","name":"Chang Ma","hidden":false},{"_id":"65b209fd7cd0069ad6d823ed","name":"Junlei Zhang","hidden":false},{"_id":"65b209fd7cd0069ad6d823ee","name":"Zhihao Zhu","hidden":false},{"_id":"65b209fd7cd0069ad6d823ef","user":{"_id":"60ed85032f5bf6261afa9e53","avatarUrl":"/avatars/13cdff802172696dab8bd6ddc5703981.svg","isPro":false,"fullname":"Yang Cheng","user":"yc1999","type":"user"},"name":"Cheng Yang","status":"claimed_verified","statusLastChangedAt":"2024-06-17T07:34:46.950Z","hidden":false},{"_id":"65b209fd7cd0069ad6d823f0","name":"Yujiu Yang","hidden":false},{"_id":"65b209fd7cd0069ad6d823f1","name":"Yaohui Jin","hidden":false},{"_id":"65b209fd7cd0069ad6d823f2","name":"Zhenzhong Lan","hidden":false},{"_id":"65b209fd7cd0069ad6d823f3","name":"Lingpeng Kong","hidden":false},{"_id":"65b209fd7cd0069ad6d823f4","name":"Junxian He","hidden":false}],"publishedAt":"2024-01-24T01:51:00.000Z","title":"AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents","summary":"Evaluating large language models (LLMs) as general-purpose agents is\nessential for understanding their capabilities and facilitating their\nintegration into practical applications. However, the evaluation process\npresents substantial challenges. A primary obstacle is the benchmarking of\nagent performance across diverse scenarios within a unified framework,\nespecially in maintaining partially-observable environments and ensuring\nmulti-round interactions. Moreover, current evaluation frameworks mostly focus\non the final success rate, revealing few insights during the process and\nfailing to provide a deep understanding of the model abilities. To address\nthese challenges, we introduce AgentBoard, a pioneering comprehensive benchmark\nand accompanied open-source evaluation framework tailored to analytical\nevaluation of LLM agents. AgentBoard offers a fine-grained progress rate metric\nthat captures incremental advancements as well as a comprehensive evaluation\ntoolkit that features easy assessment of agents for multi-faceted analysis\nthrough interactive visualization. This not only sheds light on the\ncapabilities and limitations of LLM agents but also propels the\ninterpretability of their performance to the forefront. Ultimately, AgentBoard\nserves as a significant step towards demystifying agent behaviors and\naccelerating the development of stronger LLM agents.","upvotes":0,"discussionId":"65b209fe7cd0069ad6d82416","ai_summary":"AgentBoard is an open-source evaluation framework for LLM agents that provides detailed, multi-faceted analysis and interactive visualization to understand capabilities and limitations in partially-observable environments.","ai_keywords":["AgentBoard","large language models","LLM agents","benchmarking","partially-observable environments","multi-round interactions","fine-grained progress rate","comprehensive evaluation toolkit","interactive visualization","interpretability"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["*"]}">
Papers
arxiv:2401.13178

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Published on Jan 24, 2024
Authors:
,
,
,
,
,
,
,

Abstract

AgentBoard is an open-source evaluation framework for LLM agents that provides detailed, multi-faceted analysis and interactive visualization to understand capabilities and limitations in partially-observable environments.

AI-generated summary

Evaluating large language models (LLMs) as general-purpose agents is essential for understanding their capabilities and facilitating their integration into practical applications. However, the evaluation process presents substantial challenges. A primary obstacle is the benchmarking of agent performance across diverse scenarios within a unified framework, especially in maintaining partially-observable environments and ensuring multi-round interactions. Moreover, current evaluation frameworks mostly focus on the final success rate, revealing few insights during the process and failing to provide a deep understanding of the model abilities. To address these challenges, we introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents. AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit that features easy assessment of agents for multi-faceted analysis through interactive visualization. This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront. Ultimately, AgentBoard serves as a significant step towards demystifying agent behaviors and accelerating the development of stronger LLM agents.

Community

@librarian-bot recommend

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.13178 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 1