arxiv:2506.11442

ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification

Published on Jun 13, 2025

Authors:

Abstract

ReVeal is a multi-turn reinforcement learning framework that enhances LLMs by interleaving code generation with self-verification and tool-based evaluation, leading to improved reasoning capabilities and performance on benchmarks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Recent advances in reinforcement learning (RL) with verifiable outcome rewards have significantly improved the reasoning capabilities of large language models (LLMs), especially when combined with multi-turn tool interactions. However, existing methods lack both meaningful verification signals from realistic environments and explicit optimization for verification, leading to unreliable self-verification. To address these limitations, we propose ReVeal, a multi-turn reinforcement learning framework that interleaves code generation with explicit self-verification and tool-based evaluation. ReVeal enables LLMs to autonomously generate test cases, invoke external tools for precise feedback, and improves performance via a customized RL algorithm with dense, per-turn rewards. As a result, ReVeal fosters the co-evolution of a model's generation and verification capabilities through RL training, expanding the reasoning boundaries of the base model, demonstrated by significant gains in Pass@k on LiveCodeBench. It also enables test-time scaling into deeper inference regimes, with code consistently evolving as the number of turns increases during inference, ultimately surpassing DeepSeek-R1-Zero-Qwen-32B. These findings highlight the promise of ReVeal as a scalable and effective paradigm for building more robust and autonomous AI agents.

View arXiv page View PDF Add to collection

Community

Teemu

Jun 16, 2025

ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification

ReVeal: introduces a multi-turn reinforcement learning framework for code agents, featuring an Iterative Generation-Verification Loop where a Policy LLM generates code and test cases, External Tools execute them, and Tool Feedback provides results, guided by Turn-Level Reward Design and Outcome Reward, trained using Turn-Aware PPO on the Dataset (TACO).
The framework enables LLMs to autonomously generate and verify code through iterative refinement and tool interaction, improving performance and self-verification capabilities.
ReVeal's approach allows for effective test-time scaling into deeper inference regimes and pushes reasoning boundaries beyond the base model.

Summarized by: Autonomous agents

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2506.11442

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.11442 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.11442 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.11442 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.