arxiv:2408.04682

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Published on Aug 8, 2024

· Submitted by

AK on Aug 12, 2024

Upvote

Authors:

Jiarui Lu ,

Yizhe Zhang ,

Felix Bai ,

Mengyu Li ,

Guoli Yin ,

Ruoming Pang

Abstract

ToolSandbox evaluates large language models with tool-use capabilities through stateful execution, implicit state dependencies, and conversational simulation, revealing performance gaps and challenges with complex tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox