MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench evaluates whether AI systems can invent generalizable and scalable ML methods. It spans 140tasks across 12 domains — language models, vision and generation, reinforcement learning, robotics, ML systems, AI for science, optimization, time series, causal reasoning, and more.
Methods that stood the test of time and scale.
Modern AI progress is built on a small set of reusable ideas — convolutions, residual connections, attention, normalization — that generalize across architectures and survive every order-of-magnitude jump in scale.
Weight-shared receptive fields that scaled vision models.
Distributed embeddings transferable across NLP tasks.
Adversarial generator–discriminator game for sample synthesis.
Adaptive moment estimation that became the default optimizer.
Encoder–decoder with skip links — vision and diffusion staple.
Clipped policy ratio that made deep RL stable to scale.
Mean-free normalization, faster and surprisingly sufficient.
Rotary position encoding that scales with context length.
IO-aware exact attention that scaled context length.
Weight-shared receptive fields that scaled vision models.
Distributed embeddings transferable across NLP tasks.
Adversarial generator–discriminator game for sample synthesis.
Adaptive moment estimation that became the default optimizer.
Encoder–decoder with skip links — vision and diffusion staple.
Clipped policy ratio that made deep RL stable to scale.
Mean-free normalization, faster and surprisingly sufficient.
Rotary position encoding that scales with context length.
IO-aware exact attention that scaled context length.
Gated recurrence enabling long-range sequence learning.
Random unit masking that became the standard regularizer.
Normalizing activations across the batch to stabilize training.
Residual connections enabling 100+ layer training.
Self-attention as the universal sequence operator.
Input–label interpolation that improved generalization.
Denoising diffusion: learn to invert a noise process.
Low-rank adapters for parameter-efficient finetuning.
Gated recurrence enabling long-range sequence learning.
Random unit masking that became the standard regularizer.
Normalizing activations across the batch to stabilize training.
Residual connections enabling 100+ layer training.
Self-attention as the universal sequence operator.
Input–label interpolation that improved generalization.
Denoising diffusion: learn to invert a noise process.
Low-rank adapters for parameter-efficient finetuning.
MLS-Bench tests whether AI agents can invent the next ones.
Each task isolates a well-defined research question and asks the agent to propose a single modular improvement — a new loss, an attention variant, a sampler, a routing rule — then measures whether the change transfers across models, datasets, and seeds.
140 executable tasks across 12 domains, each built around a targeted ML component, a controlled edit surface, and multi-setting evidence for transfer.
Leaderboard
Score on the official 30-task MLS-Bench-Lite subset.
| # | Model | Harness | Performance |
|---|---|---|---|
| 1 | Claude Opus 4.8Closed | Claude Code (max effort)Closed | 42.8 |
| 2 | GPT-5.5Closed | Codex (xhigh)Open | 35.5 |
| 3 | Kimi K2.7 CodeOpen | Kimi-CodeOpen | 35.1 |
| 4 | Kimi K2.6Open | Kimi-CodeOpen | 26.7 |
Results are from the Kimi K2.7 Code model card. We have verified. The evaluation is based on Harbor with a 5-hour exploration budget for each agent, not the native harness used for our main paper results but is highly encouraged.
Model Performance by Category
Each model's bar shows Vanilla as the darker lower portion and Agent as the lighter overlay, against a translucent grey Human SOTA reference computed from the reproduced human baselines. Scores use the paper's normalized task metric.
Task Categories
140 tasks across 12 flat categories. Open a category to browse its tasks.