What happens after LLM pre-training?
This map explains the post-training pipeline: supervised fine-tuning, preference data, reward modelling, RLHF, DPO, PPO, GRPO, RLVR, evaluation, and where extra compute is going.
- Alignment methods: SFT, RLHF, DPO, and reward-model-based optimisation.
- Reasoning methods: RLVR, verifiable rewards, math, code, tool use, and evaluation loops.
- System pressure: post-training increasingly changes compute demand, data workflows, and inference behaviour.