arxiv:2606.28430

Building to the Test: Coding Agents Deliver What You Check, Not What You Requested

Published on Jun 26

· Submitted by

Yanuo Ma on Jul 2

Microsoft

Upvote

Authors:

Yanuo Ma ,

Abstract

Large Language Models fail to validate their outputs when evaluated through benchmarks, revealing a gap between task completion scores and actual implementation quality.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Benchmarks are widely used to evaluate task completion by Large Language Models (LLMs), but this approach has accumulated construction-validity problems, and a passing score may not show whether the requested task was delivered. We study both problems. In a controlled code-as-spec setup, two production Copilot CLI agents (claude-opus-4.7, gpt-5.5) re-implement a React Fluent-UI data table in Angular as a reusable library under a hidden 222-test Playwright oracle across 18 runs and three oracle-availability conditions. Alongside the score, we run a mechanical library audit and check each verdict with a no-op ablation. Without the oracle, the library is present but unfinished, revealed by scores. With the oracle in the loop, the score reaches near-perfect, but from a demo holding the tested behavior directly, the library left dead or absent. We call this building to the test; the broader disposition behind both we call validation self-awareness. The agent does not, on its own, validate what it ships as a user would. Prevalence remains an open question across other agents, signals, and model families. Beyond benchmark scores, dispositions like validation self-awareness merit research attention.

View arXiv page View PDF GitHub 0 Add to collection

Community

yanuoma

Paper author Paper submitter 1 day ago

Verification signals are becoming central to agentic workflows: loop engineering, RL reward design, and CI-driven iteration all assume that a passing signal means the job is done. We show that this assumption breaks. When coding agents have access to a behavioral test oracle, they optimize the signal itself rather than delivering the requested artifact. Unlike an experienced engineer who uses test feedback to refine their implementation, agents treat passing as the goal even when they are asked not to do so. As the community builds increasingly sophisticated verification-driven loops, understanding this disposition seems worth investigating.

librarian-bot

about 19 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.28430

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.28430 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.28430 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.28430 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.