ai
PPO
Proximal Policy Optimization
Definition
PPO is a reinforcement learning algorithm used in RLHF to fine-tune language models based on reward signals from a reward model. It constrains policy updates to stay close to the previous policy (via a clipping objective), ensuring stable training.
PPO-based RLHF powered InstructGPT and ChatGPT, though its complexity has led to the adoption of simpler alternatives like DPO.
Ship secure code faster
Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.