Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts
[go: Go Back, main page]

https://github.com/euReKa025/AgentLongBench.

\n","updatedAt":"2026-01-30T08:44:19.048Z","author":{"_id":"64f033ef82c6eea604c4da8b","avatarUrl":"/avatars/51b93fea7fd68b4274ee03701245dcca.svg","fullname":"Xiaoran Liu (SII)","name":"SII-xrliu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8754210472106934},"editors":["SII-xrliu"],"editorAvatarUrls":["/avatars/51b93fea7fd68b4274ee03701245dcca.svg"],"reactions":[],"isReport":false}},{"id":"697d2f73c4a1cb60eccd9788","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-01-30T22:23:47.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/agentlongbench-a-controllable-long-benchmark-for-long-contexts-agents-via-environment-rollouts-5957-1c8a02a2\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/agentlongbench-a-controllable-long-benchmark-for-long-contexts-agents-via-environment-rollouts-5957-1c8a02a2

\n
    \n
  • Executive Summary
  • \n
  • Detailed Breakdown
  • \n
  • Practical Applications
  • \n
\n","updatedAt":"2026-01-30T22:23:47.543Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6374766230583191},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"697d5d308b0dd3ffab6dca5f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-01-31T01:38:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction](https://huggingface.co/papers/2601.06966) (2026)\n* [Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents](https://huggingface.co/papers/2601.19935) (2026)\n* [AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts](https://huggingface.co/papers/2601.11044) (2026)\n* [NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents](https://huggingface.co/papers/2512.12730) (2025)\n* [IDRBench: Interactive Deep Research Benchmark](https://huggingface.co/papers/2601.06676) (2026)\n* [ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation](https://huggingface.co/papers/2601.06328) (2026)\n* [Jenius Agent: Towards Experience-Driven Accuracy Optimization in Real-World Scenarios](https://huggingface.co/papers/2601.01857) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-01-31T01:38:56.738Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7125200033187866},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"697dc560891a824242ea3f70","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false},"createdAt":"2026-01-31T09:03:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXiv explained breakdown of this paper ๐Ÿ‘‰ https://arxivexplained.com/papers/agentlongbench-a-controllable-long-benchmark-for-long-contexts-agents-via-environment-rollouts\n","html":"

arXiv explained breakdown of this paper ๐Ÿ‘‰ https://arxivexplained.com/papers/agentlongbench-a-controllable-long-benchmark-for-long-contexts-agents-via-environment-rollouts

\n","updatedAt":"2026-01-31T09:03:28.526Z","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6025025248527527},"editors":["grantsing"],"editorAvatarUrls":["/avatars/3aedb9522cc3cd08349d654f523fd792.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.20730","authors":[{"_id":"697b161fdf3e800774f13d05","name":"Shicheng Fang","hidden":false},{"_id":"697b161fdf3e800774f13d06","name":"Yuxin Wang","hidden":false},{"_id":"697b161fdf3e800774f13d07","user":{"_id":"64f033ef82c6eea604c4da8b","avatarUrl":"/avatars/51b93fea7fd68b4274ee03701245dcca.svg","isPro":false,"fullname":"Xiaoran Liu (SII)","user":"SII-xrliu","type":"user"},"name":"XiaoRan Liu","status":"claimed_verified","statusLastChangedAt":"2026-02-02T17:00:39.556Z","hidden":false},{"_id":"697b161fdf3e800774f13d08","name":"Jiahao Lu","hidden":false},{"_id":"697b161fdf3e800774f13d09","name":"Chuanyuan Tan","hidden":false},{"_id":"697b161fdf3e800774f13d0a","name":"Xinchi Chen","hidden":false},{"_id":"697b161fdf3e800774f13d0b","name":"Yining Zheng. Xuanjing Huang","hidden":false},{"_id":"697b161fdf3e800774f13d0c","name":"Xipeng Qiu","hidden":false}],"publishedAt":"2026-01-28T16:05:44.000Z","submittedOnDailyAt":"2026-01-30T06:14:19.038Z","title":"AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts","submittedOnDailyBy":{"_id":"64f033ef82c6eea604c4da8b","avatarUrl":"/avatars/51b93fea7fd68b4274ee03701245dcca.svg","isPro":false,"fullname":"Xiaoran Liu (SII)","user":"SII-xrliu","type":"user"},"summary":"The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce AgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues.","upvotes":19,"discussionId":"697b1620df3e800774f13d0d","ai_summary":"AgentLongBench evaluates large language models as autonomous agents through dynamic environment interactions, revealing challenges in handling high-information-density tool responses compared to memory fragmentation in long conversations.","ai_keywords":["Large Language Models","autonomous agents","dynamic contexts","AgentLongBench","Lateral Thinking Puzzles","environment rollouts","knowledge-intensive scenarios","knowledge-free scenarios","information density","tool responses","memory fragmentation","long-turn dialogues"],"organization":{"_id":"613b0dee83ec35d460684607","name":"OpenMOSS-Team","fullname":"OpenMOSS","avatar":"https://cdn-uploads.huggingface.co/production/uploads/61457b8deff2c9fdb4de4988/N5b9663zQ4uq5_OTNlnmw.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"662dc427b10f82aa4f3c35a0","avatarUrl":"/avatars/a507c4897e16ca88d6d83e63bb30bca0.svg","isPro":false,"fullname":"Jiahao Lu(SII)","user":"ljhinfudan","type":"user"},{"_id":"6729d2c902305fd2366b3763","avatarUrl":"/avatars/d957eee957d5d53d1915972dae15a6ef.svg","isPro":false,"fullname":"yxzwang","user":"yxzwang","type":"user"},{"_id":"68bbf3e203a1179f02eeccf2","avatarUrl":"/avatars/1e94e2ada1e0718e3987dfc3b6c8316a.svg","isPro":false,"fullname":"fang(SII)","user":"ign1s","type":"user"},{"_id":"62c14609ac1b639c2d87192c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656833489364-noauth.png","isPro":false,"fullname":"SII-liangtianyi","user":"tianyilt","type":"user"},{"_id":"63ec4715c81b6a52391c46b8","avatarUrl":"/avatars/496819b5075a1a834a2b9edeb068c80e.svg","isPro":false,"fullname":"QinyuanCheng","user":"Cqy2019","type":"user"},{"_id":"688ccf0fbcd8210c2d4e4e2a","avatarUrl":"/avatars/0d193a900fc867469ffc7a23745543eb.svg","isPro":false,"fullname":"Zhiyuan Li","user":"HharryY","type":"user"},{"_id":"637169557a5e5d8efdc3e58e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1668515232215-637169557a5e5d8efdc3e58e.jpeg","isPro":false,"fullname":"Haowei Zhang","user":"freesky","type":"user"},{"_id":"66ab85440e1b938d84ee2b11","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ab85440e1b938d84ee2b11/8UnubTbO-vrOu2uG4TuUL.jpeg","isPro":false,"fullname":"Tarl","user":"Y-Tarl","type":"user"},{"_id":"6459c7c10aba070266e41bb1","avatarUrl":"/avatars/2178cac69cf4123db5e85191160f3795.svg","isPro":false,"fullname":"mqhuang","user":"LutherXD","type":"user"},{"_id":"667cced9cb6800a191427c1f","avatarUrl":"/avatars/9802f6f6eefcc98de89fda29860e8000.svg","isPro":false,"fullname":"Zhen Yu","user":"ZaneYue","type":"user"},{"_id":"687480035060681881befcd4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/LmVdaHtLDHjsrFGVAmS1C.png","isPro":false,"fullname":"ๅผ ๅฅ•ๆด‹ Zhang Yiyang (SII)","user":"CloudRipple","type":"user"},{"_id":"660fd30e5c9f8ba45fc50a9e","avatarUrl":"/avatars/0bc38056c4cdaa2b026fd866678a10e7.svg","isPro":false,"fullname":"Libf","user":"Libf","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"613b0dee83ec35d460684607","name":"OpenMOSS-Team","fullname":"OpenMOSS","avatar":"https://cdn-uploads.huggingface.co/production/uploads/61457b8deff2c9fdb4de4988/N5b9663zQ4uq5_OTNlnmw.png"}}">
Papers
arxiv:2601.20730

AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

Published on Jan 28
ยท Submitted by
Xiaoran Liu (SII)
on Jan 30
ยท OpenMOSS-Team OpenMOSS
Authors:
,
,
,
,
,
,

Abstract

AgentLongBench evaluates large language models as autonomous agents through dynamic environment interactions, revealing challenges in handling high-information-density tool responses compared to memory fragmentation in long conversations.

AI-generated summary

The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce AgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues.

Community

Paper author Paper submitter

The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce \textbf{AgentLongBench}, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues. The code is available at https://github.com/euReKa025/AgentLongBench.

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/agentlongbench-a-controllable-long-benchmark-for-long-contexts-agents-via-environment-rollouts-5957-1c8a02a2

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.20730 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.20730 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.