The RL post-training map

What happens after LLM pre-training?

This map explains the post-training pipeline: supervised fine-tuning, preference data, reward modelling, RLHF, DPO, PPO, GRPO, RLVR, evaluation, and where extra compute is going.

Alignment methods: SFT, RLHF, DPO, and reward-model-based optimisation.
Reasoning methods: RLVR, verifiable rewards, math, code, tool use, and evaluation loops.
System pressure: post-training increasingly changes compute demand, data workflows, and inference behaviour.