Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
sergiopaniego (Sergio Paniego)
[go: Go Back, main page]

Sergio Paniego's picture
Building on HF

Sergio Paniego PRO

sergiopaniego
huggingface

AI & ML interests

None yet

Recent Activity

updated a dataset about 21 hours ago
agents-course/final-certificates
updated a dataset about 21 hours ago
agents-course/course-certificates-of-excellence
posted an update 1 day ago
Frontier agents are this good partly because the model was trained inside the very harness it ships with. NVIDIA's new paper "Polar: Agentic RL on Any Harness at Scale" brings that recipe to the open: it turns coding harnesses like Codex, Claude Code, Qwen Code or Pi into RL training environments without touching their internals. The core idea: every agent, however complex or closed, talks to a model through an API, so they put a proxy there. The harness runs exactly like in production while the proxy records prompts, sampled token ids and logprobs. Trajectories get rebuilt outside, token faithful, so gradients hit the exact tokens the policy sampled. The gains are consistent across all four harnesses. Same Qwen3.5-4B, plain GRPO, evaluated on SWE-Bench Verified: Codex 3.8 → 26.4 (+22.6) Claude Code 29.8 → 34.6 (+4.8) Qwen Code 34.6 → 35.2 (+0.6) Pi 34.2 → 40.4 (+6.2) The biggest gains appear on unfamiliar execution paths, Codex being the clearest case. The takeaway: you are not just training a model, you are training the model + harness system. Two engineering pieces make it work at scale. Async worker pools isolate container boots (CPU), agent execution (GPU) and long tail test runs, so slow runtimes never block the GPUs. And prefix merging stitches hundreds of captured API calls back into contiguous traces: 5.4x faster trainer updates and rollout GPUs at 88% utilization. It also doubles as an SFT data factory: 504 test verified agent traces from a 122B teacher, multi-turn conversations averaging 104 messages each, coming to the Hub under Apache 2.0 (release pending review). Paper authors: Binfeng Xu, Hao Zhang, Shaokun Zhang, Songyang Han, Mingjie Liu, Jian Hu, Shizhe Diao, Zhenghui Jin, Yunheng Zou, Michael Demoret, Jan Kautz and Yi Dong. > Paper: https://huggingface.co/papers/2605.24220 > Code: https://github.com/NVIDIA-NeMo/ProRL-Agent-Server > Training data: https://huggingface.co/datasets/NovaSky-AI/SkyRL-v0-293-data
View all activity

Organizations

Hugging Face's profile picture The LLM Course's profile picture trl internal testing's profile picture TRL's profile picture Huggingface Projects's profile picture Hugging Face H4's profile picture Blog-explorers's profile picture ZeroGPU Explorers's profile picture Hugging Face Discord Community's profile picture H company's profile picture Cookbook Authors's profile picture open/ acc's profile picture RoboticsLabURJC's profile picture Hugging Face Agents Course's profile picture llrehf's profile picture gg-hf-gm's profile picture Hugging Face Context Course's profile picture all things vision LMs's profile picture a smol course's profile picture nanochat students's profile picture OpenEnv: Agentic Execution Environments's profile picture gg-hf-gg's profile picture ML intern explorers's profile picture