Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
[go: Go Back, main page]

https://arxivlens.com/PaperView/Details/swe-evo-benchmarking-coding-agents-in-long-horizon-software-evolution-scenarios-3642-2ce677ee

\n
    \n
  • Executive Summary
  • \n
  • Detailed Breakdown
  • \n
  • Practical Applications
  • \n
\n","updatedAt":"2025-12-26T22:39:51.679Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5486928224563599},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2512.18470","authors":[{"_id":"694b8451746a34b55dd53e46","name":"Minh V. T. Thai","hidden":false},{"_id":"694b8451746a34b55dd53e47","name":"Tue Le","hidden":false},{"_id":"694b8451746a34b55dd53e48","name":"Dung Nguyen Manh","hidden":false},{"_id":"694b8451746a34b55dd53e49","name":"Huy Phan Nhat","hidden":false},{"_id":"694b8451746a34b55dd53e4a","name":"Nghi D. Q. Bui","hidden":false}],"publishedAt":"2025-12-20T19:08:15.000Z","submittedOnDailyAt":"2025-12-25T06:14:25.360Z","title":"SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios","submittedOnDailyBy":{"_id":"63a44141de134926a2ba0458","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a44141de134926a2ba0458/FP3uxpsKL1nBrrKyiFne3.jpeg","isPro":false,"fullname":"Nghi Bui","user":"bdqnghi","type":"user"},"summary":"Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.","upvotes":12,"discussionId":"694b8452746a34b55dd53e4b","githubRepo":"https://github.com/bdqnghi/SWE-EVO","githubRepoAddedBy":"auto","ai_summary":"SWE-EVO benchmark evaluates AI coding agents on complex, multi-step software evolution tasks across multiple files, highlighting a significant gap in current models' capabilities.","ai_keywords":["SWE-EVO","long-horizon software evolution","release notes","version histories","multi-step modifications","comprehensive test suites","GPT-5","OpenHands","single-issue SWE-Bench Verified","Fix Rate","sustained","multi-file reasoning"],"githubStars":34},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63a44141de134926a2ba0458","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a44141de134926a2ba0458/FP3uxpsKL1nBrrKyiFne3.jpeg","isPro":false,"fullname":"Nghi Bui","user":"bdqnghi","type":"user"},{"_id":"6323f399462470712720c155","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6323f399462470712720c155/SWsMNa7vETUSrOt9Qf-oe.png","isPro":false,"fullname":"Yinxu Pan","user":"cppowboy","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},{"_id":"64f70d5eb23f1414f73c5e50","avatarUrl":"/avatars/38e6c4e65276930997b9b69473711591.svg","isPro":false,"fullname":"Lu Yu","user":"VoladorLuYu","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"64f3279216247159da1ac486","avatarUrl":"/avatars/2287234ca01ed74a65608ff7e5c2e0d0.svg","isPro":false,"fullname":"Sam Rafie","user":"SamRF","type":"user"},{"_id":"66350130b04434bd64f3dc5b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66350130b04434bd64f3dc5b/GctGxtEXTJ3QThokXcp4M.png","isPro":true,"fullname":"Jonatan Borkowski","user":"j14i","type":"user"},{"_id":"684d57f26e04c265777ead3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cuOj-bQqukSZreXgUJlfm.png","isPro":false,"fullname":"Joakim Lee","user":"Reinforcement4All","type":"user"},{"_id":"631fab1ba6bd81a3596a220d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631fab1ba6bd81a3596a220d/Wn4MSyoTrjYb94ZEnAGq7.jpeg","isPro":false,"fullname":"TED Vortex","user":"0-vortex","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"},{"_id":"64834b399b352597e41816ac","avatarUrl":"/avatars/63d9d123bffa90f43186a0bdc4455cbd.svg","isPro":false,"fullname":"Shaobai Jiang","user":"shaobaij","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2512.18470

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Published on Dec 20, 2025
· Submitted by
Nghi Bui
on Dec 25, 2025
Authors:
,
,
,
,

Abstract

SWE-EVO benchmark evaluates AI coding agents on complex, multi-step software evolution tasks across multiple files, highlighting a significant gap in current models' capabilities.

AI-generated summary

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.

Community

Paper submitter

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.

arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/swe-evo-benchmarking-coding-agents-in-long-horizon-software-evolution-scenarios-3642-2ce677ee

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.18470 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.18470 in a Space README.md to link it from this page.

Collections including this paper 4