Workbench Evals: Rapid Prototyping
Anthropic Workbench for manual prompt testing. Running evals across multiple test cases. Comparing prompt versions (v1 vs v2). Human scoring on a 1-5 scale.
Open Anthropic Workbench. Create a prompt with two variables. Add 3 test cases, score results (1-5), improve the prompt, and compare versions via Add Comparison.
Copy and adapt to your context. Text in angle brackets should be replaced.
You are a skilled programmer translating code to Python.
<source_code>
{{SOURCE_CODE}}
</source_code>
Source language: {{SOURCE_LANGUAGE}}
Translate to Python. Format:
<python_code>
[translation here]
</python_code>
Only output the <python_code> tags, no preamble or explanation.- Evaluating only one test case — not representative.
- Not versioning prompts — forgetting exactly what changed.
- Subjective scoring without criteria — different people score differently.