Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-04T01:40:06.232Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7454479932785034},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.01675","authors":[{"_id":"69817ddcce18b18628096148","user":{"_id":"664e055998e93ef417837573","avatarUrl":"/avatars/cefecdcd2b60055ae59cafcb6e263191.svg","isPro":false,"fullname":"Yuanzhe Shen","user":"OceanSky","type":"user"},"name":"Yuanzhe Shen","status":"claimed_verified","statusLastChangedAt":"2026-02-03T10:05:54.039Z","hidden":false},{"_id":"69817ddcce18b18628096149","user":{"_id":"662b6a8f0b7f23f3c000559e","avatarUrl":"/avatars/0a5b4e09ac9a8e40342131319ff32b29.svg","isPro":false,"fullname":"Zisu Huang","user":"zisuh","type":"user"},"name":"Zisu Huang","status":"claimed_verified","statusLastChangedAt":"2026-02-03T10:05:56.138Z","hidden":false},{"_id":"69817ddcce18b1862809614a","name":"Zhengyuan Wang","hidden":false},{"_id":"69817ddcce18b1862809614b","name":"Muzhao Tian","hidden":false},{"_id":"69817ddcce18b1862809614c","name":"Zhengkang Guo","hidden":false},{"_id":"69817ddcce18b1862809614d","name":"Chenyang Zhang","hidden":false},{"_id":"69817ddcce18b1862809614e","name":"Shuaiyu Zhou","hidden":false},{"_id":"69817ddcce18b1862809614f","name":"Zengjie Hu","hidden":false},{"_id":"69817ddcce18b18628096150","name":"Dailin Li","hidden":false},{"_id":"69817ddcce18b18628096151","name":"Jingwen Xu","hidden":false},{"_id":"69817ddcce18b18628096152","name":"Kaimin Wang","hidden":false},{"_id":"69817ddcce18b18628096153","name":"Wenhao Liu","hidden":false},{"_id":"69817ddcce18b18628096154","name":"Tianlong Li","hidden":false},{"_id":"69817ddcce18b18628096155","name":"Fengpeng Yue","hidden":false},{"_id":"69817ddcce18b18628096156","name":"Feng Hong","hidden":false},{"_id":"69817ddcce18b18628096157","name":"Cao Liu","hidden":false},{"_id":"69817ddcce18b18628096158","name":"Ke Zeng","hidden":false}],"publishedAt":"2026-02-02T05:43:08.000Z","submittedOnDailyAt":"2026-02-03T08:06:26.071Z","title":"TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios","submittedOnDailyBy":{"_id":"664e055998e93ef417837573","avatarUrl":"/avatars/cefecdcd2b60055ae59cafcb6e263191.svg","isPro":false,"fullname":"Yuanzhe Shen","user":"OceanSky","type":"user"},"summary":"As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we introduce TRIP-Bench, a long-horizon benchmark grounded in realistic travel-planning scenarios. TRIP-Bench leverages real-world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50\\% success on the easy split, with performance dropping below 10\\% on hard subsets. We further propose GTPO, an online multi-turn reinforcement learning method with specialized reward normalization and reward differencing. Applied to Qwen2.5-32B-Instruct, GTPO improves constraint satisfaction and interaction robustness, outperforming Gemini-3-Pro in our evaluation. We expect TRIP-Bench to advance practical long-horizon interactive agents, and GTPO to provide an effective online RL recipe for robust long-horizon training.","upvotes":9,"discussionId":"69817dddce18b18628096159","ai_summary":"TRIP-Bench presents a comprehensive long-horizon benchmark for travel planning that evaluates LLM agents on complex multi-turn interactions, while GTPO offers an online reinforcement learning approach to enhance constraint satisfaction and robustness in extended dialogues.","ai_keywords":["LLM-based agents","long-horizon benchmark","travel-planning scenarios","automated evaluation","multi-turn interactions","reinforcement learning","reward normalization","reward differencing","Qwen2.5-32B-Instruct","Gemini-3-Pro"],"organization":{"_id":"6282108b1c4fdf630c7943a4","name":"meituan","fullname":"meituan","avatar":"https://cdn-uploads.huggingface.co/production/uploads/61ac8f8a00d01045fca0ad2f/4TJKRXMsRyZsyi4H3rWsh.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"662b6a8f0b7f23f3c000559e","avatarUrl":"/avatars/0a5b4e09ac9a8e40342131319ff32b29.svg","isPro":false,"fullname":"Zisu Huang","user":"zisuh","type":"user"},{"_id":"664e055998e93ef417837573","avatarUrl":"/avatars/cefecdcd2b60055ae59cafcb6e263191.svg","isPro":false,"fullname":"Yuanzhe Shen","user":"OceanSky","type":"user"},{"_id":"698184369d8c0ad24f73b5e9","avatarUrl":"/avatars/6b9a0fb7387dd73da20ea9704a5ca73d.svg","isPro":false,"fullname":"Yang","user":"SkyeYangofficial","type":"user"},{"_id":"68a6f9b1faa397f80d9ca14a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/zUf1RYU_clrbewrOkSupq.jpeg","isPro":false,"fullname":"Moore Tian","user":"MooreMuaMu","type":"user"},{"_id":"677d6b2cfb54e56e7ee2fc00","avatarUrl":"/avatars/004679a076587ac6a785db1beaeac730.svg","isPro":false,"fullname":"threadcup","user":"threadcup","type":"user"},{"_id":"684d57f26e04c265777ead3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cuOj-bQqukSZreXgUJlfm.png","isPro":false,"fullname":"Joakim Lee","user":"Reinforcement4All","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},{"_id":"698472c53608d30804870b34","avatarUrl":"/avatars/74285da6566e1f803ca544c80d5e61ca.svg","isPro":false,"fullname":"ZHANG Yifan","user":"ZYF0225","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"6282108b1c4fdf630c7943a4","name":"meituan","fullname":"meituan","avatar":"https://cdn-uploads.huggingface.co/production/uploads/61ac8f8a00d01045fca0ad2f/4TJKRXMsRyZsyi4H3rWsh.jpeg"}}">
Papers
arxiv:2602.01675

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

Published on Feb 2
· Submitted by
Yuanzhe Shen
on Feb 3
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

TRIP-Bench presents a comprehensive long-horizon benchmark for travel planning that evaluates LLM agents on complex multi-turn interactions, while GTPO offers an online reinforcement learning approach to enhance constraint satisfaction and robustness in extended dialogues.

AI-generated summary

As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we introduce TRIP-Bench, a long-horizon benchmark grounded in realistic travel-planning scenarios. TRIP-Bench leverages real-world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50\% success on the easy split, with performance dropping below 10\% on hard subsets. We further propose GTPO, an online multi-turn reinforcement learning method with specialized reward normalization and reward differencing. Applied to Qwen2.5-32B-Instruct, GTPO improves constraint satisfaction and interaction robustness, outperforming Gemini-3-Pro in our evaluation. We expect TRIP-Bench to advance practical long-horizon interactive agents, and GTPO to provide an effective online RL recipe for robust long-horizon training.

Community

Paper author Paper submitter

We are motivated by the gap between existing LLM-agent benchmarks and real deployment needs, where agents must handle long, multi-turn interactions, satisfy global constraints, and coordinate tools under frequent user revisions. We introduce TRIP-Bench, a realistic travel-planning benchmark with 18 tools, 40+ constraint types, and automated evaluation across difficulty splits, and show that even strong models degrade sharply on harder long-horizon dialogues. Finally, we propose GTPO, an online multi-turn RL method that improves constraint satisfaction and robustness on TRIP-Bench.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.01675 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.01675 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.01675 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.