Data Preparation for Chat Model Fine-Tuning
Complete pipeline for checking and analyzing a JSONL dataset before fine-tuning: format errors, token statistics, and training cost estimation.
Data Format for Fine-Tuning
Each example in a JSONL file is an object with a messages key, matching the Chat Completions format:
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"}
]
}
Loading and Basic Validation
import json
from collections import defaultdict
with open("data/toy_chat_fine_tuning.jsonl", "r") as f:
dataset = [json.loads(line) for line in f]
format_errors = defaultdict(int)
for ex in dataset:
if not isinstance(ex, dict):
format_errors["data_type"] += 1
continue
messages = ex.get("messages")
if not messages:
format_errors["missing_messages_list"] += 1
continue
for msg in messages:
if "role" not in msg or "content" not in msg:
format_errors["message_missing_key"] += 1
if msg.get("role") not in ("system", "user", "assistant", "function"):
format_errors["unrecognized_role"] += 1
if not any(m["role"] == "assistant" for m in messages):
format_errors["example_missing_assistant_message"] += 1
if format_errors:
for k, v in format_errors.items():
print(f"{k}: {v}")
else:
print("No errors found")
Token Counting and Cost Estimation
import tiktoken, numpy as np
enc = tiktoken.get_encoding("cl100k_base")
def num_tokens(messages):
total = 3
for m in messages:
total += 3
for v in m.values():
total += len(enc.encode(str(v)))
return total
convo_lens = [num_tokens(ex["messages"]) for ex in dataset]
print(f"Mean tokens/example: {np.mean(convo_lens):.0f}")
n_too_long = sum(l > 16385 for l in convo_lens)
print(f"Examples over token limit: {n_too_long}")
# Cost estimate (3 epochs by default)
total_billing = sum(min(16385, l) for l in convo_lens)
print(f"~{total_billing * 3} billing tokens for 3 epochs")
A rough rule of thumb: 50–100 well-curated examples produce a noticeable effect; more than 1,000 is usually sufficient for most tasks. Always run your dataset through this validation script before uploading to the API.
Take any of your own dialogue JSONL files, run the validation and token-counting script, fix all errors found, and calculate the approximate training cost for 3 epochs.
Copy and adapt to your context. Text in angle brackets should be replaced.
You are validating a JSONL dataset for OpenAI fine-tuning. Run the full validation script from the ft-data-prep lesson. Print a summary: example count, errors (if any), min/max/mean tokens, number of examples exceeding the 16,385-token limit, and estimated training cost for 3 epochs at current OpenAI pricing.
Forgetting to verify that every example contains at least one assistant message; not accounting for the fact that examples longer than 16,385 tokens will be truncated, not discarded.
Use tiktoken before uploading — this lets you estimate cost without a single API call. The minimum effective dataset is 50 examples; for tone/style improvements even 20 can work.
Before every new fine-tuning task to verify data and pre-calculate budget.
Not needed for inference or prompt engineering without fine-tuning.