Fine-Tuning Chat Models: End-to-End
End-to-end walkthrough: preparing JSONL, uploading files via the Files API, launching a fine-tuning job, monitoring progress, and running inference on the fine-tuned model.
Preparing Training Data
import json, pandas as pd
from openai import OpenAI
client = OpenAI()
recipe_df = pd.read_csv("data/cookbook_recipes_nlg_10k.csv")
system_msg = "You are a helpful recipe assistant. Extract generic ingredients."
def prepare_example(row):
return {
"messages": [
{"role": "system", "content": system_msg},
{"role": "user",
"content": f"Title: {row['title']}\n\nIngredients: {row['ingredients']}\n\nGeneric ingredients: "},
{"role": "assistant", "content": row["NER"]},
]
}
training_data = recipe_df.loc[0:100].apply(prepare_example, axis=1).tolist()
validation_data = recipe_df.loc[101:200].apply(prepare_example, axis=1).tolist()
def write_jsonl(data, path):
with open(path, "w") as f:
for d in data:
f.write(json.dumps(d) + "\n")
write_jsonl(training_data, "train.jsonl")
write_jsonl(validation_data, "val.jsonl")
Uploading Files and Launching the Job
def upload_file(path):
with open(path, "rb") as f:
return client.files.create(file=f, purpose="fine-tune").id
train_id = upload_file("train.jsonl")
val_id = upload_file("val.jsonl")
job = client.fine_tuning.jobs.create(
training_file=train_id,
validation_file=val_id,
model="gpt-4o-mini-2024-07-18",
suffix="recipe-ner",
)
print("Job ID:", job.id, "| Status:", job.status)
Monitoring and Inference
import time
while True:
status = client.fine_tuning.jobs.retrieve(job.id).status
print(f"Status: {status}")
if status in ("succeeded", "failed", "cancelled"):
break
time.sleep(60)
ft_model = client.fine_tuning.jobs.retrieve(job.id).fine_tuned_model
# Inference
resp = client.chat.completions.create(
model=ft_model,
messages=[
{"role": "system", "content": system_msg},
{"role": "user", "content": "Title: Pasta\nIngredients: [...]\nGeneric ingredients: "},
],
temperature=0,
max_tokens=200,
)
print(resp.choices[0].message.content)
Start with 50–100 examples. Performance scales roughly linearly with dataset size, so iteratively increase the size until you reach your target accuracy.
Create a dataset of 50+ examples for a named entity extraction task in your domain, launch a fine-tuning job on gpt-4o-mini, and compare results against the base model on 20 test examples.
Copy and adapt to your context. Text in angle brackets should be replaced.
You are launching a fine-tuning job via the OpenAI API. Prepare train.jsonl and val.jsonl (minimum 50 examples in train), upload them via the Files API, create a job with model gpt-4o-mini-2024-07-18, wait for the succeeded status, and run inference on 5 test examples to evaluate quality.
Dataset too small (fewer than 10 examples); running inference before the job reaches succeeded status; using outdated model IDs.
The suffix parameter makes the model ID human-readable (ft:gpt-4o-mini:your-org:recipe-ner:...). The validation file lets you track overfitting via training_loss vs. validation_loss directly in the dashboard.
When you need to teach the model a specific output format, style, or domain and prompt engineering is no longer sufficient.
For adding factual knowledge — use RAG instead. For tasks with subjective quality assessment — consider DPO.