After scores and checkmarks, the next reward is a sentence. An opinionated tour through the arc of feedback in RL for LLMs — scalar RLHF, verifiable RLVR, and the rise of verbal feedback — culminating in Ditto.
An opinionated tour through the algorithm tree of modern LLM RL — PPO, GRPO, REINFORCE, REINFORCE++, DPO, and the theoretical ideas that tie them together.
Exploring what makes AI agents truly effective for users, beyond benchmark performance.
Stop using outdated bad word lists. Use ToxicTrig instead for better toxic language analysis.