Chapter 16 · companion worksheet

Read-50-traces eval starter

The most valuable thing you can do before building automated evals costs nothing but time: read fifty real outputs by hand. The failure modes you find in these fifty will tell you what to measure — and they will be failure modes the dashboard was not showing you.

Instructions

Pull fifty real interactions from your AI feature — from production, not the demo. Real inputs, real outputs.
Read each one as a manager reviewing an employee's work. Do not ask "did it follow instructions." Ask: was this output actually good for the person who received it?
For each output that falls short, write down the specific reason — not a category. The actual reason. ("It invented a policy we don't have." "It answered only one of two questions." "Tone would have made the customer angrier.")
Who reads them: the domain expert who knows what good looks like for this work — the senior AP clerk, the experienced paralegal, the support lead. Not the engineer who built the system, and not the executive. The person who does the work.
After all fifty: tally the failure reasons. They will cluster into patterns. Those patterns are your eval criteria.

Trace review log

#	Input (brief summary)	What the AI did	Pass / Fail	Failure mode (if fail — be specific)
1
2
3
4
5
6
7
8
9
10
11–50	(continue on additional rows or a separate sheet)

Tally: top failure modes

After reading all fifty, list the failure modes that appeared most often. These become your automated eval criteria.

Failure mode (in plain language)	Count	Example trace #	Becomes an automated eval?
1.			☐ Yes ☐ No
2.			☐ Yes ☐ No
3.			☐ Yes ☐ No
4.			☐ Yes ☐ No
5.			☐ Yes ☐ No

Summary box

Metric	Your result
Total traces reviewed
Pass count
Fail count
Pass rate
Top failure mode
Second failure mode
Reviewed by (domain expert)
Date reviewed
Next step

What comes after the fifty

☐ Use the failure modes above to write specific automated eval criteria — not abstract ("is this good") but concrete ("does it invent a policy not in our documentation").
☐ If using an LLM-as-judge to grade at scale: calibrate the judge against your human grades on a sample before trusting its scores.
☐ Add evals to your deployment pipeline so every prompt change, model swap, or retrieval adjustment gets checked before it reaches a customer.

Want a second set of eyes on this in your firm? The no-sell promise applies — if it isn't a fit, I'll tell you in the first ten minutes.

Book a 30-Minute Call →