← Field Notes/Building
FIELD NOTE /003 · Feb 22, 2026 · 9 min read

Evals, not vibes: how we ship AI that does not embarrass us

A walkthrough of the eval harness we attach to every production AI workflow. What we test, what we ignore, what wakes us up at 3 a.m.

TL;DR — Key takeaways
4 bullets · 30-second read
  • The rule: every production AI workflow ships with an eval harness that runs on every deploy. No exceptions.
  • What we test: classifier accuracy, drafted-output fidelity, refusal correctness, and adversarial cases pulled from real client incidents.
  • How it plugs in: evals run pre-deploy, post-deploy, and on a nightly schedule. A failed eval blocks the release.
  • When to stop: when the marginal eval costs more to maintain than the bug it would catch. Most teams stop at ~80 cases per agent.
EVAL SET · 280 CASES
grows with every incident
Adversarialpast 3 a.m. pages
30
Goldencurated, easy
50
Regressionreal client data
200
Runs on every PR · merges blocked if ↓ 2pp accuracyfull set · ≤ 6 min
FIG · 03 · the three eval bucketsAdversarial set is the institutional memory

One Tuesday afternoon in 2024, a system we had shipped two weeks earlier started misclassifying invoices. Quietly. The client did not notice for nine days. By then, three months of general-ledger data needed to be cleaned up. That is the day we started caring about evals more than vibes.

An eval is just a test that runs against your AI behavior. The phrase has gotten slightly loaded. The work has not. If you have written a unit test, you understand 80% of the job. The other 20% is figuring out what the right answer even is, which turns out to be the hard part.

This note is the rough version of the eval pattern we now wire into every production AI workflow we ship. It is not academic. It is what we do because we got burned and never want to get burned the same way again.

The day we shipped without evals

We had a small invoice classifier running in production. Accuracy on our internal sample was 94%. The client tested it for a day and signed off. We turned it on and went to sleep.

What we missed was that the client had a small set of vendors who had recently switched their billing format. Our training set did not include the new format. The model started routing those invoices into the wrong GL bucket. Nobody noticed, because the dollar amounts looked plausible.

By the time we caught it, the GL needed 11 days of cleanup. The client was patient. We were not. We rebuilt the deploy pipeline that week, and we have shipped behind a regression eval ever since.

What we test

For each AI step, we want answers to four questions:

  • Does it beat a deterministic rule? If a regex would do the job at 90% accuracy, ship the regex. Models are expensive and noisy.
  • Does it beat the previous version of itself? Catching regressions is the entire point of CI for AI.
  • Does it handle the long tail? Not the cute examples. The 5% of cases that have caused real human pain in the past.
  • Does it fail loudly when it is uncertain? A model returning a confident wrong answer is worse than no model at all.

You will notice none of these are about absolute accuracy. We do not chase the 99.9% number. We chase "measurably better than the alternative, and behaves like an adult under stress."

Eval set anatomy · what we ship in every repo
50
golden cases (curated, easy)
200
regression cases (real client data)
30
adversarial cases (the ones that bit us)

Anatomy of an eval set

We split every eval set into three buckets. The first is golden cases. About 50 examples a human could solve in five minutes. These exist to catch the "we broke the obvious thing" failure. Most teams stop here. Most teams ship things that break.

The second is regression cases. About 200 examples sampled from real client data, intentionally kept messy. These exist to catch the "model used to handle this and now does not" failure. We grow this set every week.

The third bucket is the most important: adversarial cases. About 30 examples, every one of which represents a past production incident or a known-hard input. The badge of honor for an AI engineer on our team is adding a new adversarial case after a 3 a.m. page.

The adversarial set is the institutional memory of every dumb mistake the system has ever made. It only grows.

How evals plug into deploy

We run evals in three places. On every PR, on every deploy, and as a scheduled probe in production. The PR run is fast, runs the golden + regression sets, and blocks merge if accuracy drops by more than two percentage points. The deploy run is stricter, runs the adversarial set, and requires a human signoff if any adversarial case fails.

The production probe runs every six hours. It samples a few hundred recent decisions, replays them through the current model, and alerts us if the answer distribution drifts. This is the cheapest piece of safety we ship and the one we trust the most.

◆ Builder note

Treat eval failures the way you would treat a failing unit test. If a teammate disabled a unit test to ship, you would say something. Same rule.

What wakes us up at 3 a.m.

The classic 3 a.m. failure is silent drift. The model still answers. The answers look plausible. They are quietly wrong. This is the failure that cost us the nine days of GL cleanup. It is also the one most teams never plan for, because their evals only run pre-deploy.

The probe in production catches this. So does our second-favorite mechanism: a confidence threshold below which the system routes the case to a human. We hate hearing that an AI workflow is "fully autonomous." It should not be. Every system we ship has a place for a human to step in when the model is unsure.

When to stop adding evals

Eval bloat is real. We have seen teams write so many evals that the test suite takes 40 minutes and nobody runs it. The fix is the same as for any test suite. Tier the evals. Keep the fast ones on every PR. Move the slow ones to nightly. Delete the ones that have not failed in six months.

The point of evals is not to feel safe. The point is to ship faster, with the safety baked in. If your eval suite is slowing your team down without catching anything, prune it. The good ones earn their place.


If you are shipping AI without an eval harness, the cheapest thing we can do for you is set one up. It is part of every Build engagement. Book a diagnostic if you want to talk through it.

VR
Written by
Varun R.
Founder. Writing from inside the engagement, not from the sidelines.

Keep reading.

All field notes