Don't Read the PDF. Write the Parser.

A robot repairing itself in a workshop — self-healing parsers

Every morning a folder fills up with hospital PDFs. Bed maps, ER triage counts, admission flows, occupancy indices. Different units, different layouts, different quirks. Classic document-intelligence territory.

The default 2026 move is one API call: hand the PDF to a vision-capable model, ask for JSON, trust the answer. Fine for a weekend project. For a hospital director’s dashboard, I couldn’t get past four questions:

  1. Can I point at where a number came from? If the dashboard shows “43 occupied beds”, I want to highlight the three characters in the PDF that produced it. Vision models give you a number, not the source.
  2. What does it cost to reprocess a year of reports? Ten thousand PDFs through a vision model is real money. Ten thousand PDFs through Go regex is free.
  3. How do I write a regression test? “The model got it right last time” is not a test.
  4. What happens the morning the layout changes? I want a loud failure, not a silent one I discover three weeks later as “the numbers feel off.”

So I flipped it. Use AI to write the parser, not to be the parser. The parser runs on every document, for free. The AI runs once — when the parser breaks.

The architecture, in one picture

Pure Go at runtime. The extractor binary imports no LLM SDKs — grep -r "anthropic\|openai" extractor/ comes back empty. The AI lives one layer up, on my laptop, inside Claude Code. It only runs when a real parser fails on a real document. Zero tokens per document processed.

How much of the page did we actually read?

The first thing I needed was a number that tells me “the parser is still in sync with the document” without knowing the right answer in advance. I settled on character coverage: every successful regex match records its position in the source text; after extraction, I add those positions up and compare them to the total non-whitespace characters in the document.

// CalculateCoverage returns the percentage of non-whitespace
// characters in the source document matched by some regex.
func (a *Auditor) CalculateCoverage() float64 {
    merged := mergeOverlappingMatches(a.matches)
    var totalMeaningful, matchedMeaningful int
    for i, r := range a.SourceText {
        if unicode.IsSpace(r) {
            continue
        }
        totalMeaningful++
        if insideAnyMatch(i, merged) {
            matchedMeaningful++
        }
    }
    return float64(matchedMeaningful) / float64(totalMeaningful) * 100
}

No ML. No embeddings. Just: how much of the text did we actually parse? If I matched 96% of the non-whitespace characters, the adapter is in sync with the layout. If coverage drops to 62%, something changed. I don’t need to know what changed to act on it — I just need to know.

Do the numbers add up?

Coverage tells me how much I parsed. It doesn’t tell me whether what I parsed makes sense. That’s what domain checks are for.

// Occupied + Free + Reserved + Blocked must equal Total (±2 beds of jitter).
sum := occupied + free + reserved + blocked
if abs(sum - total) > 2 {
    return fmt.Errorf("%s: parts (%d) don't match total (%d)",
        unit.name, sum, total)
}

A vision model will give you 42 + 7 + 2 + 0 = 51 against a total of 48, and you won’t notice until the director asks. A deterministic parser with a semantic check refuses the row and raises a warning. Two beds of slack covers the cases where a transfer is in progress and briefly double-counts.

Where bad data goes

If coverage drops below 0.80, or any semantic check fails, the extractor does not write to the production table. The report goes to quarantine instead.

if metadata.ConfidenceScore < 0.80 || len(metadata.Warnings) > 0 {
    log.Info().
        Float64("confidence", metadata.ConfidenceScore).
        Int("warnings", len(metadata.Warnings)).
        Msg("Quarantining report")
    // insert into quarantine table + move email to Quarantine folder
}

Quarantine is a Postgres table plus a dedicated IMAP folder:

CREATE TABLE quarantine (
    id               SERIAL PRIMARY KEY,
    report_type      TEXT NOT NULL,
    source_file      TEXT NOT NULL,
    extracted_text   TEXT,
    confidence_score REAL,
    warnings         TEXT,
    status           TEXT DEFAULT 'pending',
    email_uid        BIGINT,
    email_folder     TEXT
);

Bad data has somewhere to go. The dashboard stays consistent with the reports I do trust. I have a clean list of what needs attention, and an exact copy of what the extractor saw — so I can reproduce the failure on my laptop every time.

The /heal command

This is the AI part, and it’s smaller than people expect. In its simplest form, no endpoint, no webhook, no scheduled model call. Just a slash command in Claude Code that runs a fixed checklist.

# /heal [ID]

1. Gather context
   - `make inspect-quarantine ID=[ID]` — pull raw text, report type, warnings
   - Locate the adapter in `extractor/adapters/`

2. Repair
   - Edit only the adapter
   - Do not touch the data structs

3. Dry-run
   - `make test-quarantine ID=[ID]`
   - Confidence must be >= 0.95, no warnings

4. Regress
   - `make test` must be green

5. Freeze the fixture
   - `make sanitize-report ID=[ID]`
   - Save to `testdata/<report_type>/<date>.txt`
   - Add a test that asserts the exact extracted struct

6. Stop and ask before committing

I type /heal 42. The AI walks the list. It reads the quarantined text, finds the adapter, proposes a change, runs the dry-run, runs the full test suite, writes a fixture, adds a test, and stops. I read the diff. If it’s obvious, I merge.

Most of the time, the fix is one line — a new alias for a column header:

configs := []fieldConfig{
    {[]string{"Total Beds", "Beds Total", "All Beds"}, setTotalBeds},
    // ...
}

The hospital renamed Total Beds to All Beds one Tuesday. Coverage dropped to 73%. The AI appended the new label, generated a fixture, wrote a test, opened a diff. Ten lines. Done. That report has come through clean every day since.

One full morning, narrated

  • 07:02 — a bed-map report arrives. The hospital added a new unit called ICU-Paediatrics. Coverage: 71%. Quarantined.
  • 09:15 — I open Claude Code, type /heal 42. The AI runs make inspect-quarantine ID=42, fetches the raw text.
  • 09:16 — it reads bed-map.go, notices the unit-section regex stops at a hardcoded list of names, adds the new unit, broadens the stop-marker.
  • 09:17make test-quarantine ID=42 → 97.2% coverage. No warnings.
  • 09:17make test → all existing fixtures pass.
  • 09:18make sanitize-report ID=42 writes a clean fixture. A test is added that asserts the exact extracted struct.
  • 09:20 — I read the diff. Ten lines, no surprises. Merge.
  • next morning, 07:02 — the ICU-Paediatrics report comes through clean. No quarantine. No AI call. No cost.

Model time spent: about 90 seconds, on my laptop. Tokens spent in production: zero. Tests added: one, permanent.

The cost picture, over a year

  • AI reading every PDF. One call per document, forever. A hundred documents a day at a few cents each is real money per hospital, per report type. Reprocessing history is another bill.
  • AI fixing the parser. One parser written once. Maybe a dozen one-line heal PRs a year. Reprocessing is a make backfill target and costs nothing.

Cost isn’t only money. I can explain every number on the dashboard. I can replay any day. I can point at the exact regex that produced a field. I can write a test for a fix in thirty seconds.

The one thing vision buys you is a faster cold start on a brand-new report type. I paid that cost once per type, by hand with the AI as a pair. I haven’t paid it again.

The ingredients, if you want to copy this

You don’t need embeddings, a vector database, or a fine-tuned model. You need:

  • A clear contract. A struct with required fields and a Validate() method. Without this, “the parser works” has no definition.
  • A coverage signal. Something that tells you how much of the document you actually parsed, without needing to know the right answer. Character-range coverage is the cheapest version. It catches layout drift before the first wrong number ships.
  • Semantic checks. Parts equal total. Triage levels sum to total patients. Timestamps are monotonic. Domain invariants catch the cases where coverage looks fine but the values are nonsense.
  • A quarantine path. A Postgres table, an S3 bucket, an IMAP folder — somewhere bad data goes that is not the production table.
  • A dry-run verb. make test-quarantine ID=x runs the proposed fix without writing anything. This is the difference between “I think it works” and “I measured that it works.”
  • A fixed healing checklist. A slash command, a CLAUDE.md section, a script — it doesn’t matter which. What matters is that the AI follows the same steps every time, and the last step is “a human reviews the diff.”

The whole healing loop in this project is about four hundred lines of Go and two markdown files.

What doesn’t work

  • First parsers still have a cold-start cost. You write the first one by hand, with the AI as a pair, per report type. Vision models skip that step. If you have hundreds of one-off document types, this approach is a bad fit.
  • Coverage is a proxy, not truth. A parser can match 99% of a document and still read one number wrong. Coverage catches structural drift, not field-level mistakes. Semantic checks catch those — which means your domain model needs real invariants, not just types.
  • Not every AI fix is a good fix. The AI sometimes broadens a regex to cover an edge case, and the broader regex then over-matches on a different report. The test suite catches most of these, not all. I still read every diff.
  • The checklist is the product. The moment you let the model skip the dry-run, skip make test, or skip the fixture step, the whole loop collapses. The value isn’t the model — it’s the checklist.
  • Don’t automate the merge. The first bad fix that goes into master breaks the whole approach.

One step further

The remaining friction is the /heal trigger itself. The quarantine INSERT is already an event — nothing stops you from firing a webhook that kicks off a CI job, runs the same checklist, and opens a draft PR automatically. A PR shows up in your queue instead. The rule stays the same: don’t automate the merge.

The pattern generalizes past PDFs. Anywhere you have a stable output contract and inputs that keep changing — CSV imports, web scrapers, log parsers, vendor API adapters — the same five ingredients apply. A contract, a coverage signal, semantic checks, a quarantine, a dry-run. The AI sits in the development loop, writing and patching the thing, instead of in the hot path pretending to be the thing.

The side effect I like most: when the model does make a mistake — and it will — the damage is one pull request, not a year of corrupted data you didn’t catch.