Read-50-traces eval starter
The most valuable thing you can do before building automated evals costs nothing but time: read fifty real outputs by hand. The failure modes you find in these fifty will tell you what to measure — and they will be failure modes the dashboard was not showing you.
Instructions
- Pull fifty real interactions from your AI feature — from production, not the demo. Real inputs, real outputs.
- Read each one as a manager reviewing an employee's work. Do not ask "did it follow instructions." Ask: was this output actually good for the person who received it?
- For each output that falls short, write down the specific reason — not a category. The actual reason. ("It invented a policy we don't have." "It answered only one of two questions." "Tone would have made the customer angrier.")
- Who reads them: the domain expert who knows what good looks like for this work — the senior AP clerk, the experienced paralegal, the support lead. Not the engineer who built the system, and not the executive. The person who does the work.
- After all fifty: tally the failure reasons. They will cluster into patterns. Those patterns are your eval criteria.
Trace review log
| # | Input (brief summary) | What the AI did | Pass / Fail | Failure mode (if fail — be specific) |
|---|---|---|---|---|
| 1 | ||||
| 2 | ||||
| 3 | ||||
| 4 | ||||
| 5 | ||||
| 6 | ||||
| 7 | ||||
| 8 | ||||
| 9 | ||||
| 10 | ||||
| 11–50 | (continue on additional rows or a separate sheet) | |||
Tally: top failure modes
After reading all fifty, list the failure modes that appeared most often. These become your automated eval criteria.
| Failure mode (in plain language) | Count | Example trace # | Becomes an automated eval? |
|---|---|---|---|
| 1. | ☐ Yes ☐ No | ||
| 2. | ☐ Yes ☐ No | ||
| 3. | ☐ Yes ☐ No | ||
| 4. | ☐ Yes ☐ No | ||
| 5. | ☐ Yes ☐ No |
Summary box
| Metric | Your result |
|---|---|
| Total traces reviewed | |
| Pass count | |
| Fail count | |
| Pass rate | |
| Top failure mode | |
| Second failure mode | |
| Reviewed by (domain expert) | |
| Date reviewed | |
| Next step |
What comes after the fifty
- ☐ Use the failure modes above to write specific automated eval criteria — not abstract ("is this good") but concrete ("does it invent a policy not in our documentation").
- ☐ If using an LLM-as-judge to grade at scale: calibrate the judge against your human grades on a sample before trusting its scores.
- ☐ Add evals to your deployment pipeline so every prompt change, model swap, or retrieval adjustment gets checked before it reaches a customer.
Want a second set of eyes on this in your firm? The no-sell promise applies — if it isn't a fit, I'll tell you in the first ten minutes.
Book a 30-Minute Call →