QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Evaluate a reward signal
before you train.

Long-horizon LLM agents take hundreds of actions, yet an outcome-only reward tells them nothing about the steps along the way. QVal measures whether a dense supervision signal is Q-aligned, so signal quality can be judged on common ground, cheaply, and apart from the engineering of any training pipeline.

Read the paper Code Datasets Quickstart Contact

Sergio Hernández-Gutiérrez¹ Matteo Merler^2,* Ilze Amanda Auzina^1,* Joschka Strüber¹ Ameya Prabhu^1,† Matthias Bethge^1,†

¹Tübingen AI Center, University of Tübingen · ²Fondazione Bruno Kessler ·
*equal contribution · †equal advising

qval: evaluate a candidate signal

The tool

A testbed for the post-training community

Dense supervision methods all try to score intermediate steps. But they are evaluated by their downstream post-training performance.

That is expensive, it conflates the quality of the signal with the engineering choices around it, and it makes methods that require different training setups impossible to compare.

“Can we evaluate dense supervision signals in isolation, before expensive post-training runs?”

A dense supervision method assigns a scalar score k(s, a) to each action. QVal measures one property: whether that score ranks decisions the way their eventual return does.

Q-alignment k(s, a) = Φ(Q^π(s, a)) where Φ is strictly increasing

The reference Q-value is the expected return of continuing from (s, a) under a strong reference policy (e.g., an optimal policy or a frontier model). Predictions live on incompatible scales, so we compare the ordering they induce, reporting Spearman ρ against the reference labels. Same inputs, same targets, fixed backbones.

How it works

The same five-stage recipe runs for every environment. You can skip the first three and use our pre-collected datasets.

Collect trajectories. Roll agents out in multi-turn environments.
Sample state–action pairs. Draw (s, a) pairs along the trajectories as evaluation points.
Label with reference Q-values. Restore each state, force the action, follow the reference policy, and record the discounted return. Use Monte-Carlo estimates for stochastic policies/environments.
Collect method predictions. Run the method of interest to score every pair.
Evaluate by rank correlation. Report Spearman ρ between the method's scores and the reference labels.

For one sampled point

Collect & sample

a state–action pair from a trajectory

→ (s, a)
Label · MVMC

max return over k rollouts

→ Q*(s, a)
Predict

the method's scalar score

→ k(s, a)

Measure Q-alignment

ρ Spearman ρ

Quickstart

The methods, environments, and their datasets are all included. To score a built-in method on a built-in environment, install once, then generate the configs, predict, and evaluate.

install

# clone llenvs next to qval (path dependency)
git clone https://github.com/serhez/llenvs
git clone https://github.com/bethgelab/qval
cd qval
uv pip install -e ".[all]"

generate

python scripts/generate_configs.py

predict

python scripts/pipeline/predict.py --config \
  catalogs/qval_benchmark/configs/prediction/frozen_lake/100pt_8x8_q35-27-or_text.yaml

evaluate

# discovers every prediction, writes the correlations
python scripts/pipeline/evaluate.py

Models are served through the backends declared in shared/configs/backends.yaml.
For full runs, the wrappers in scripts/slurm/ submit each step as a cluster job.

Extending QVal

QVal runs on your own methods and environments. Register them in a catalog file (catalogs/<your_experiment>/catalog.py), then rerun generate_configs.py to build their configs.

add-method

A method turns a (state, action, next_state) step into a scalar. Subclass DenseSignalMethod and implement evaluate (override evaluate_batch if it batches), register its type in method_factory.py, then append a MethodSpec to the catalog and regenerate.

my_method.py

class MyMethod(DenseSignalMethod):
    def evaluate(self, point: EvaluationPoint) -> float:
        # point.state / action / next_state / history;
        # self.context: task & reward text, signal_type …
        return float(...)        # math.nan on failure

add-environment

Environments come through llenvs, a stateless interface (env.step(state, action) is pure) that makes Monte-Carlo rollouts from arbitrary states possible. Write an environment-context YAML, add an EnvironmentSpec to the catalog, collect a dataset once, and regenerate.

shared/configs/environments/my_env.yaml

adapter: my_env               # an llenvs adapter
env_name: my_env:eval
max_steps: 40
reward_signal_name: task_completion
task_description: >-
  The agent acts in …
evaluator_extractor:          # method text → number
  type: numeric

The benchmark

QVal-v1.0

It is the first instance of the QVal methodology: the exact methods, environments, and datasets scored below.

21methods

7families

6backbones

4environments

1.2K+experiments

0training runs

Environments

Spanning text and vision, from closed-action navigation to open-ended shell use.

FrozenLake grid world — **FrozenLake**goal-directed navigation

ALFWorld household scene — **ALFWorld**embodied reasoning

OpenApps application UI — **OpenApps**computer & application use

TerminalBench shell session — **TerminalBench**programming & agentic terminal

Leaderboard

Explore every method's Q-alignment below.

Findings

Main results

Simple prompting is the strongest baseline.

Ranking and direct value prediction reach the highest Q-alignment in every environment and backbone, outperforming newer and more specialized dense-supervision methods.

Performance clusters by family.

Correlations fall into tight bands within each of the seven families: the family predicts a method's behavior better than its individual design. Code methods are the exception, with by far the widest spread, since their effectiveness depends on how easily a state and action space can be written in code.

Added complexity rarely helps.

Elaborate variants don't reliably beat the simplest one in their family. Multi-estimate and batched/sequential direct prompting don't significantly outperform direct-single; privileged self-distillation (sdpo-gt) doesn't improve on plain sdpo; averaging generated functions (codegen-avg) lifts the mean over codegen only slightly.

Difficulty doesn't predict alignment.

From closed-action FrozenLake to open-ended TerminalBench, Q-alignment does not fall monotonically with difficulty. Direct prompting stays positive everywhere; code and ranking weaken in open-ended settings, while self-distillation grows stronger where richer intermediate feedback is available.

Robustness

Text beats vision.

On environments that provide both, methods recover reference values more reliably from text than from images.

Rankings survive the choice of target.

Relabelling with state-values V(s) instead of Q-values largely preserves the method ordering. Absolute scores shift (code and pre-trained methods align better with state-values, direct prompting with Q-values), but not the order.

Implications

Signal quality is conflated with training recipes.

Because plain prompting is competitive when measured directly, much of the reported progress of complex methods may come from changes in data, compute, exploration, prompting, or optimization rather than from a better dense signal. Measuring Q-alignment first separates the two.

Score interpretation.

QVal measures signal quality on its own and leaves integration as a separate downstream question. It is a cheap diagnostic that filters candidate signals before expensive training runs, not a replacement for them.

Citation

BibTeX

Loading…

Evaluate a reward signalbefore you train.