Evaluations 101: Why Measure Prompts
The difference between benchmarks and customer evals. The four eval components: input, golden answer, output, score. Three grading approaches: human-based, code-based, model-based.
Take a real task (classification, data extraction, text generation). Create 10 test cases with golden answers. Run a baseline prompt and record the accuracy as your benchmark.
Task grader
Copy and adapt to your context. Text in angle brackets should be replaced.
Help me design an eval dataset. Prompt task: <describe what the prompt should do> Sample inputs: <3-5 examples> Criteria for a correct answer: <how to define success> Suggest: 1. eval_data structure (Python dict) 2. Grading method (code/model/human) 3. Minimum accuracy threshold for production
Prompt sandbox
- Test cases are too simple — the eval fails to surface edge cases.