Report a bug

Reinforcement Fine-Tuning: Training with a Verifiable Reward — OpenAI Fine-tuning: From Data to DPO