Evals, not vibes: how we ship AI that does not embarrass us

One Tuesday afternoon in 2024, a system we had shipped two weeks earlier started misclassifying invoices. Quietly. The client did not notice for nine days. By then, three months of general-ledger data needed to be cleaned up. That is the day we started caring about evals more than vibes.

An eval is just a test that runs against your AI behavior. The phrase has gotten slightly loaded. The work has not. If you have written a unit test, you understand 80% of the job. The other 20% is figuring out what the right answer even is, which turns out to be the hard part.

This note is the rough version of the eval pattern we now wire into every production AI workflow we ship. It is not academic. It is what we do because we got burned and never want to get burned the same way again.

The day we shipped without evals

We had a small invoice classifier running in production. Accuracy on our internal sample was 94%. The client tested it for a day and signed off. We turned it on and went to sleep.

What we missed was that the client had a small set of vendors who had recently switched their billing format. Our training set did not include the new format. The model started routing those invoices into the wrong GL bucket. Nobody noticed, because the dollar amounts looked plausible.

By the time we caught it, the GL needed 11 days of cleanup. The client was patient. We were not. We rebuilt the deploy pipeline that week, and we have shipped behind a regression eval ever since.

What we test

For each AI step, we want answers to four questions:

Does it beat a deterministic rule? If a regex would do the job at 90% accuracy, ship the regex. Models are expensive and noisy.
Does it beat the previous version of itself? Catching regressions is the entire point of CI for AI.
Does it handle the long tail? Not the cute examples. The 5% of cases that have caused real human pain in the past.
Does it fail loudly when it is uncertain? A model returning a confident wrong answer is worse than no model at all.

You will notice none of these are about absolute accuracy. We do not chase the 99.9% number. We chase "measurably better than the alternative, and behaves like an adult under stress."

Eval set anatomy · what we ship in every repo

golden cases (curated, easy)

200

regression cases (real client data)

adversarial cases (the ones that bit us)

Anatomy of an eval set

We split every eval set into three buckets. The first is golden cases. About 50 examples a human could solve in five minutes. These exist to catch the "we broke the obvious thing" failure. Most teams stop here. Most teams ship things that break.

The second is regression cases. About 200 examples sampled from real client data, intentionally kept messy. These exist to catch the "model used to handle this and now does not" failure. We grow this set every week.

The third bucket is the most important: adversarial cases. About 30 examples, every one of which represents a past production incident or a known-hard input. The badge of honor for an AI engineer on our team is adding a new adversarial case after a 3 a.m. page.

The adversarial set is the institutional memory of every dumb mistake the system has ever made. It only grows.

How evals plug into deploy

We run evals in three places. On every PR, on every deploy, and as a scheduled probe in production. The PR run is fast, runs the golden + regression sets, and blocks merge if accuracy drops by more than two percentage points. The deploy run is stricter, runs the adversarial set, and requires a human signoff if any adversarial case fails.

The production probe runs every six hours. It samples a few hundred recent decisions, replays them through the current model, and alerts us if the answer distribution drifts. This is the cheapest piece of safety we ship and the one we trust the most.

◆ Builder note

Treat eval failures the way you would treat a failing unit test. If a teammate disabled a unit test to ship, you would say something. Same rule.

What wakes us up at 3 a.m.

The classic 3 a.m. failure is silent drift. The model still answers. The answers look plausible. They are quietly wrong. This is the failure that cost us the nine days of GL cleanup. It is also the one most teams never plan for, because their evals only run pre-deploy.

The probe in production catches this. So does our second-favorite mechanism: a confidence threshold below which the system routes the case to a human. We hate hearing that an AI workflow is "fully autonomous." It should not be. Every system we ship has a place for a human to step in when the model is unsure.

When to stop adding evals

Eval bloat is real. We have seen teams write so many evals that the test suite takes 40 minutes and nobody runs it. The fix is the same as for any test suite. Tier the evals. Keep the fast ones on every PR. Move the slow ones to nightly. Delete the ones that have not failed in six months.

The point of evals is not to feel safe. The point is to ship faster, with the safety baked in. If your eval suite is slowing your team down without catching anything, prune it. The good ones earn their place.

If you are shipping AI without an eval harness, the cheapest thing we can do for you is set one up. It is part of every Build engagement. Book a diagnostic if you want to talk through it.

The day we shipped without evals

What we test

Anatomy of an eval set

How evals plug into deploy

What wakes us up at 3 a.m.

When to stop adding evals

Keep reading.

How a $5M DTC company actually runs its books

The three failure modes that kill 80% of ops projects

The smart CRM: stop replacing it, start layering on top