Prompt Evaluations: Anthropic's Official Course · Lesson 1

Evaluations 101: Why Measure Prompts

The difference between benchmarks and customer evals. The four eval components: input, golden answer, output, score. Three grading approaches: human-based, code-based, model-based.

30 min read3 questions in quizReady prompt includedIn progress

Practical exercise

What to do after this lesson

Take a real task (classification, data extraction, text generation). Create 10 test cases with golden answers. Run a baseline prompt and record the accuracy as your benchmark.

Task grader

Take a real task (classification, data extraction, text generation). Create 10 test cases with golden answers. Run a baseline prompt and record the accuracy as your benchmark.

Your answer

Ready-to-use prompt

Template for this lesson

Copy and adapt to your context. Text in angle brackets should be replaced.

Help me design an eval dataset.

Prompt task: <describe what the prompt should do>
Sample inputs: <3-5 examples>
Criteria for a correct answer: <how to define success>

Suggest:
1. eval_data structure (Python dict)
2. Grading method (code/model/human)
3. Minimum accuracy threshold for production

Prompt sandbox

Prompt

Common mistakes

What people get wrong

Test cases are too simple — the eval fails to surface edge cases.

Report a bug

Evaluations 101: Why Measure Prompts

Task grader

Prompt sandbox

Quiz — 3 questions

Discussion

Why Evals Are the Foundation of Production LLMs

The Four Eval Components

Three Grading Approaches

The Eval Loop