OpenAI Fine-tuning: From Data to DPO · Lesson 3
Direct Preference Optimization: Learning from Preferences
DPO aligns a model to subjective criteria (tone, style, brand voice) using preferred/rejected response pairs — no reward model or RL required.
DPO aligns a model to subjective criteria (tone, style, brand voice) using preferred/rejected response pairs — no reward model or RL required.