← Back to the bonus vault
Read-50-traces eval starter diagram

Chapter 16 · companion worksheet

Read-50-traces eval starter

The most valuable thing you can do before building automated evals costs nothing but time: read fifty real outputs by hand. The failure modes you find in these fifty will tell you what to measure — and they will be failure modes the dashboard was not showing you.

Instructions

  1. Pull fifty real interactions from your AI feature — from production, not the demo. Real inputs, real outputs.
  2. Read each one as a manager reviewing an employee's work. Do not ask "did it follow instructions." Ask: was this output actually good for the person who received it?
  3. For each output that falls short, write down the specific reason — not a category. The actual reason. ("It invented a policy we don't have." "It answered only one of two questions." "Tone would have made the customer angrier.")
  4. Who reads them: the domain expert who knows what good looks like for this work — the senior AP clerk, the experienced paralegal, the support lead. Not the engineer who built the system, and not the executive. The person who does the work.
  5. After all fifty: tally the failure reasons. They will cluster into patterns. Those patterns are your eval criteria.

Trace review log

# Input (brief summary) What the AI did Pass / Fail Failure mode (if fail — be specific)
1
2
3
4
5
6
7
8
9
10
11–50(continue on additional rows or a separate sheet)

Tally: top failure modes

After reading all fifty, list the failure modes that appeared most often. These become your automated eval criteria.

Failure mode (in plain language) Count Example trace # Becomes an automated eval?
1.☐ Yes   ☐ No
2.☐ Yes   ☐ No
3.☐ Yes   ☐ No
4.☐ Yes   ☐ No
5.☐ Yes   ☐ No

Summary box

Metric Your result
Total traces reviewed
Pass count
Fail count
Pass rate
Top failure mode
Second failure mode
Reviewed by (domain expert)
Date reviewed
Next step

What comes after the fifty

Want a second set of eyes on this in your firm? The no-sell promise applies — if it isn't a fit, I'll tell you in the first ten minutes.

Book a 30-Minute Call →