OpenAI Fine-tuning: From Data to DPO · Lesson 1

Data Preparation for Chat Model Fine-Tuning

Complete pipeline for checking and analyzing a JSONL dataset before fine-tuning: format errors, token statistics, and training cost estimation.

35 min read3 questions in quizReady prompt includedIn progress

Data Format for Fine-Tuning

Each example in a JSONL file is an object with a messages key, matching the Chat Completions format:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What is 2+2?"},
    {"role": "assistant", "content": "4"}
  ]
}

Loading and Basic Validation

import json
from collections import defaultdict

with open("data/toy_chat_fine_tuning.jsonl", "r") as f:
    dataset = [json.loads(line) for line in f]

format_errors = defaultdict(int)
for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
    messages = ex.get("messages")
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
    for msg in messages:
        if "role" not in msg or "content" not in msg:
            format_errors["message_missing_key"] += 1
        if msg.get("role") not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
    if not any(m["role"] == "assistant" for m in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

Token Counting and Cost Estimation

import tiktoken, numpy as np

enc = tiktoken.get_encoding("cl100k_base")

def num_tokens(messages):
    total = 3
    for m in messages:
        total += 3
        for v in m.values():
            total += len(enc.encode(str(v)))
    return total

convo_lens = [num_tokens(ex["messages"]) for ex in dataset]
print(f"Mean tokens/example: {np.mean(convo_lens):.0f}")
n_too_long = sum(l > 16385 for l in convo_lens)
print(f"Examples over token limit: {n_too_long}")

# Cost estimate (3 epochs by default)
total_billing = sum(min(16385, l) for l in convo_lens)
print(f"~{total_billing * 3} billing tokens for 3 epochs")

A rough rule of thumb: 50–100 well-curated examples produce a noticeable effect; more than 1,000 is usually sufficient for most tasks. Always run your dataset through this validation script before uploading to the API.

Practical exercise

What to do after this lesson

Take any of your own dialogue JSONL files, run the validation and token-counting script, fix all errors found, and calculate the approximate training cost for 3 epochs.

Ready-to-use prompt

Template for this lesson

Copy and adapt to your context. Text in angle brackets should be replaced.

You are validating a JSONL dataset for OpenAI fine-tuning.
Run the full validation script from the ft-data-prep lesson.
Print a summary: example count, errors (if any), min/max/mean tokens,
number of examples exceeding the 16,385-token limit,
and estimated training cost for 3 epochs at current OpenAI pricing.

Common mistakes

What people get wrong

Forgetting to verify that every example contains at least one assistant message; not accounting for the fact that examples longer than 16,385 tokens will be truncated, not discarded.

Pro tips

What works but no one documents

Use tiktoken before uploading — this lets you estimate cost without a single API call. The minimum effective dataset is 50 examples; for tone/style improvements even 20 can work.

When to use

Before every new fine-tuning task to verify data and pre-calculate budget.

When not to use

Not needed for inference or prompt engineering without fine-tuning.

Official sources

OpenAI Cookbook

Квиз — 3 вопроса

1.What happens to a dataset example that exceeds the 16,385-token limit?

2.Which key is required in every example of a JSONL file for chat fine-tuning?

3.What are billing tokens in the context of fine-tuning cost calculation?

Отвечено: 0 из 3

Войдите, чтобы сохранять прогресс и отмечать пройденные уроки.

Войти

Fine-Tuning Chat Models: End-to-End →