An AI Wrote 200% Where It Meant 20%. The Bound Caught It.
A model is reading a structured document. It extracts a value into a config field that downstream code trusts as ground truth. The value should be 20%. Some artifact of the source makes the model output 200%: a reformatted table, a misread column header, a missing decimal. The model is just as confident as when it was right. The system applies the 200% wherever that value is used, and runs on the wrong number until someone notices it on a dashboard.
I built a system where this could happen. Then I added a layer that turned it from “a year before anyone notices” into “rejected at extraction time, flagged for human review, never reaches a live decision.”
The layer is small. It does not depend on the model. That is the point.
The trap
The architecture is reasonable. Document goes in. AI extracts structured values. Values go into a config table. Downstream code reads the config table to make decisions.
The failure mode is also reasonable. The model is good almost all of the time. Almost all of the values are right. The wrong ones are confidently wrong, look plausible, and are indistinguishable from the right ones at the moment of writing.
The first instinct is to add a confidence threshold. Reject anything below 0.9. Ship the rest. I built this version first. It worked for a while.
Why confidence does not save you
The model’s confidence is calibrated to plausibility, not correctness. When the model misreads a value as ten times its real magnitude because the source layout changed, it is not less confident: both numbers look like plausible values for the field. Confidence catches outputs the model itself feels uncertain about: garbled extractions, malformed JSON, refusal. It does not catch outputs the model feels good about but happen to be wrong.
The other thing confidence does not survive: model upgrades. The version that confidently extracts the wrong number today is a worse model than the one you swap in next quarter, which will also confidently extract the wrong number if the source is the same. The model’s confidence improves as the model improves; the model’s confidence is not a measure of correctness for any particular field. It is a measure of self-consistency.
By the time I noticed this, the thresholding system was letting through almost everything, including the plausible wrong ones. Confidence had stopped doing the job I built it for.
The pattern
Every extracted field gets a bound. The bound is a numeric range, an allowlist, or a regex, whichever fits the field. The bound is set by domain knowledge, not by the model.
var bounds = map[string]Bound{
"rate_limit_per_min": NumericRange{Min: 1, Max: 10000},
"session_timeout_min": NumericRange{Min: 1, Max: 1440},
"discount_percent": NumericRange{Min: 0, Max: 100},
"retention_days": NumericRange{Min: 0, Max: 3650},
"currency": Allowlist{"USD", "EUR", "GBP", "JPY"},
}
func (r NumericRange) Check(v float64) error {
if v < r.Min || v > r.Max {
return fmt.Errorf("value %v outside bound [%v, %v]", v, r.Min, r.Max)
}
return nil
}
The bound check is a few hundred microseconds. It runs on every extracted value, regardless of model confidence. If the bound rejects, the value does not enter the config table. It goes to a separate review queue, with the extracted value, the source-document fragment, and the bound it failed.
A 200% discount fails discount_percent immediately. The system never sees it as a valid config value.
The workflow
Three states for an extracted value:
- Proposed: extracted, bound-checked, passed the bound. Visible in the system but not used by any decision.
- Active: a human reviewed it, said “yes, this matches the source,” and promoted it. Downstream code reads from here.
- Quarantined: bound rejected, or human rejected. Flagged for re-extraction or manual entry. Never read by downstream code.
Values do not become “active” automatically, even if the model is confident and the bound passes. The bound is a necessary condition for promotion to “proposed,” not a sufficient condition for “active.” Activation requires a human.
This is heavier than it sounds. The bound catches the catastrophic case (off-by-10x). The human catches the subtle case: a value that is inside the bound but still wrong, like a four-week reference period extracted as a four-day one. Both numbers pass the bound; only one is correct. The bound and the human do different jobs. You need both.
Pairs with the self-healing parser pattern
I wrote previously about self-healing parsers: the AI patches the parser when the source layout drifts. That post was about text shape.
This post is about value shape. When the parser is right but the value is wrong, you need a separate layer. Bounds are that layer.
The two patterns compose. Self-healing keeps your extraction in sync with the source’s structure. Bounded extraction keeps your values inside the domain’s reality. Together, you have a pipeline where the AI does extraction work but its output never reaches a live decision without passing two independent guards.
What doesn’t work
- Bounds drift. The domain changes. A legal max changes. The currency list adds a new entry. Bounds need maintenance. Tight bounds reject genuine values; loose bounds let bad values through.
- Not every field has a natural bound. Free-text justifications, names, qualitative reasons: these have no numeric range. Bound them by type, length, allowlisted patterns. Looser; still catches the worst.
- Bottleneck risk. If 80% of fields need human approval, you have moved the problem from “the AI was wrong” to “the human is the bottleneck.” Mitigate with an unchanged-value fast path (a re-extracted value that matches the current active value skips review) and bulk approval for low-risk fields.
- Tight bounds reject outliers. A region with truly unusual rules looks the same as a hallucination. The system tells you “this is out of bound”, but sometimes the right response is to widen the bound, not reject the value. That is a workflow you have to design: “raise the bound” needs to be a real, audited operation.
- The proposed-to-active workflow is a product surface. Someone has to design the review UI, the diff view, the source-fragment viewer, the audit log. None of this is free. The bound check is twenty lines; the workflow around it is two months.
- A confidence-based system handles cold-start better. The first time the AI sees a new field type, you have no bound calibrated for the domain. You either ship without a bound (loose) or block until you have one (cold-start friction). I went with “block until calibrated, with a fast manual path to set the initial bound.”
The reframe
Confidence is what the model thinks of itself. Bounds are what your system thinks of the model. They are independent signals. Use both.
The version of this system that thresholded on confidence felt smarter. It also shipped wrong values, because the model’s confidence rose alongside its capability and the wrong-but-plausible outputs went through with the right ones. The version that bounds on the domain feels dumber: it rejects a model output because a number is too big, with no nuance. That dumbness is exactly what makes it useful. A bound does not know what the model said; it knows what the domain allows. That is the only thing you can trust when the model is wrong in a way that looks right.