Specialist Agents: Looking at Every Page with Different Eyes

This is Part 5 of my series on AI-assisted development. Part 4 covers how /qa-run works. This post goes deeper into the specialist agent model — how I get the AI to look at the same page from 7 different perspectives and turn findings into Linear issues.

The insight that changed everything

My first version of QA was one big prompt: “check this page for issues.” It was terrible. The AI would fixate on one thing — usually functional correctness — and miss everything else. Security, UX, performance? Invisible.

The breakthrough came from thinking about how real teams work. In a bigger company, you’d have a tester, a security person, a UX designer — each looking at the same build with different concerns. I realized I could simulate that by splitting one big QA prompt into separate specialist agents, each with its own narrow focus and checklist. They all run in parallel on the same page. The result is that each page gets evaluated 5-7 times, each time from a different angle.

This was the single biggest quality improvement in the whole workflow. Not because any individual specialist is brilliant, but because multiple focused perspectives catch things that one broad pass never will.

The 7 specialists (and why each one exists)

Each specialist is just a markdown file in qa/specialists/. I didn’t start with seven — I started with two (QA and security) and added the rest as I kept finding categories of issues I was missing. Here’s what I ended up with and what taught me to add each one.

QA (Functional Correctness)

File: qa/specialists/qa.md

This is the most important one. It checks whether things actually work:

  • Does the page load without errors?
  • Does the URL match the expected route?
  • Do protected routes redirect unauthorized users?
  • Do lists show the right data for this role?
  • Do status transitions follow the expected workflow?
  • Are empty states handled (not just blank areas)?

Its severity scale: BLOCKER (page crashes, data loss), BUG (wrong data, broken interaction), WARNING (minor inconsistency), OK.

This is the specialist that stops journeys. A QA BLOCKER means “this page is broken and the journey can’t continue.”

UX (Usability)

File: qa/specialists/ux.md

This one catches the stuff that works but confuses people:

  • Can you tell where you are? (active nav, breadcrumbs, page titles)
  • Are primary actions discoverable without instructions?
  • Does the system give feedback when you do something? (loading states, success/error messages)
  • Do destructive actions require confirmation?
  • Are related things grouped together?
  • Is the terminology consistent?

Its severity scale: UX-BLOCKER (user can’t complete the task), UX-ISSUE (task completable but confusing), UX-SUGGESTION (could be better).

I didn’t expect this one to be useful — can an AI really check UX? Turns out it catches real stuff. Things like: “the review queue has no empty state — if there are no pending reviews, the page is just blank” or “the confirm and cancel buttons are the same color — nothing distinguishes the destructive action.” I’d notice these eventually, but the specialist finds them on every page, every time.

UI (Visual Quality)

File: qa/specialists/ui.md

Layout and visual consistency:

  • Is the color palette coherent?
  • Is typography hierarchy clear (headings vs body vs labels)?
  • Is spacing consistent?
  • Are interactive elements visually distinct? (hover states, disabled states)
  • Is text contrast sufficient? (WCAG AA)
  • Do empty states have proper visual treatment?

It looks at both the page snapshot (what elements are on the page) and the screenshot (what it actually looks like).

Honestly, this is the weakest specialist. AI isn’t great at judging how things look from screenshots. But it does catch obvious stuff — like a table with no visible header, or a modal that’s missing its background overlay. I keep it because it costs nothing extra to run it alongside the others.

Security

File: qa/specialists/security.md

Auth, authorization, and data exposure:

  • Does login reject invalid credentials?
  • Is the auth token stored properly?
  • Does logout clear the session?
  • Do role-restricted routes actually block unauthorized access?
  • Do API responses contain data the user shouldn’t see?
  • Are error messages exposing internal details? (stack traces, SQL, file paths)
  • Are form inputs sanitized?
  • Are API calls using proper auth headers?

This one is especially useful for my app because I have 4 roles (Manager, Admin, Reviewer, User) with different permissions. Every journey logs in as multiple roles, and this specialist checks at every step that users can’t see stuff they shouldn’t.

A real example: the QA specialist said [OK] for a page loading fine, but the security specialist flagged [SEC-WARNING] because the API response for the user list had email addresses for all users, not just the ones the current role should see. The page was only showing the right ones — but the data was there in the network response. A normal test would never catch that.

Performance

File: qa/specialists/performance.md

Load times, responsiveness, and console health:

  • Does the page become interactive within 3 seconds?
  • Are there layout shifts after initial render?
  • Do click actions respond within 100ms?
  • Are there duplicate API calls for the same data?
  • Are there JavaScript errors or unhandled promise rejections?
  • Are there React warnings (missing keys, deprecated methods)?

This one mostly reports PERF-OK or PERF-WARNING. Haven’t seen a PERF-BLOCKER yet. But the warnings are handy — like when a page makes the same API call twice on load. Easy to miss when you’re writing code, easy to fix once someone points it out.

Data Leakage

File: qa/specialists/data-leakage.md

This is the domain-specific specialist. The app handles sensitive user data. The data leakage specialist checks:

  • AI-generated content stays within the expected scope (no soliciting names or case IDs)
  • One user’s session data doesn’t appear in another’s session
  • Page snapshots don’t contain identifiable personal information
  • API responses are scoped to the requesting user
  • Client-side storage doesn’t contain sensitive content
  • Console logs don’t dump request payloads with user data

This specialist exists because of what the app does. If you’re building a todo app, you don’t need this. If your app handles sensitive data, it’s worth thinking about what kind of leakage could happen and writing a checklist for it.

Language

File: qa/specialists/language.md

The app UI is entirely in Spanish. This specialist catches:

  • English strings that leaked into the UI (untranslated labels, AI responses switching to English)
  • Inconsistent terminology (using “procedimiento” in one place and “protocolo” for the same concept elsewhere)
  • Tone consistency (the app uses formal “usted” register — mixing in informal “tu” would be jarring)
  • Date/number formatting (Spanish conventions, not English)

I added this one after noticing that the AI sometimes generated content in English, or that error messages from the API showed up raw in English instead of being translated. Pretty niche, but if your app is in a language other than English, you’ll run into this.

How it works under the hood

The /qa-run command uses Claude Code’s Task tool to run sub-agents. Each specialist is a separate agent that gets:

  1. The specialist’s evaluation criteria (the markdown file)
  2. The step definition (what action was taken, what was expected)
  3. The captured page state (accessibility snapshot, console messages, network requests)

The prompt template looks roughly like this:

You are the {Specialist Name} evaluating Step {N} of QA Journey {ID}.

## Your Evaluation Criteria
{contents of qa/specialists/{name}.md}

## Step Definition
- Action: {what was done}
- Expected: {what should have happened}
- Blocking: {what counts as a blocker}

## Captured State
### Page Snapshot
{accessibility snapshot text}

### Console Messages
{console output}

### Network Requests
{network summary}

Evaluate this step against your checklist. Return findings in your output format.

For a step with 5 specialists, all 5 run at the same time. The main agent collects their findings and adds them to the report.

What surprised me: how findings become work

The specialists produce findings. I turn those findings into Linear issues. This is the manual step — I read the report, decide what’s worth fixing, and create issues.

Some examples of specialist findings that became real issues:

Security specialist found:

[SEC-WARNING] API response for user list includes all user emails
- Page: /admin/users
- Vector: Network tab shows full email list regardless of role filter
- Evidence: GET /api/v1/users returns all 8 users for manager role
- Recommendation: Filter response based on role permissions

→ Linear issue: “Scope /api/v1/users response by role — manager should only see their team”

UX specialist found:

[UX-ISSUE] No feedback after user submits form entry
- Page: /entries/:id
- Context: User fills form and clicks submit
- Issue: Entry appears in list but no visual confirmation (no "saved" indicator, no loading state)
- Suggestion: Add save confirmation and loading indicators

→ Linear issue: “Add save confirmation and loading indicators to data entry flow”

Language specialist found:

[LANG-WARNING] Date format inconsistency
- Page: /reviews
- Element: Review creation date
- Found: "Feb 15, 2026"
- Expected: "15 feb 2026" or "15 de febrero de 2026"
- Source: UI label

→ Linear issue: “Fix date formatting across the app to use Spanish conventions”

Performance specialist found:

[PERF-WARNING] Duplicate API call on entries page mount
- Page: /entries
- Metric: 2x GET /api/v1/entries in rapid succession
- Expected: Single fetch
- Impact: Unnecessary network usage, potential race conditions

→ Linear issue: “Fix duplicate API call on /entries page mount”

Each of these issues goes into Linear’s Backlog. When I run /plan-issue, they get refined into specs with acceptance criteria. When I run /work-issue, they get implemented. When I run /qa-run again, I verify the fixes.

Configuring which specialists run

Not every step needs every specialist. A login step needs QA and security, but probably not UI and performance. The journey definitions specify which specialists evaluate each step:

### Step 1: Manager Login & Dashboard
- **Specialists**: [qa, ux, ui, security]

### Step 7: User Submits Entry
- **Specialists**: [qa, ux, performance]

You can also override at the command line:

/qa-run 07 --specialists=qa,security    # Only functional and security

This keeps things focused. When I’m looking at performance after a batch of changes, I run with --specialists=performance. When I’m doing a full check, I let the journey definition decide.

What I’d tell you if you build your own

Adding a specialist is just adding a markdown file to qa/specialists/. The structure is:

  1. Role description — “You are a {X} specialist evaluating {Y}
  2. Evaluation checklist — numbered sections with specific things to check
  3. Output format — the structure for each finding
  4. Severity tags — what each severity level means for this specialist

The checklist is the part that matters. Be specific. “Check if the page is accessible” is useless — the AI will just say OK. “Form fields have labels, focus goes in the right order, buttons are at least 44x44px” gives it something concrete to check.

My first versions were too vague — the specialist would just report [OK] on everything. I’ve been tightening the checklists over time based on what I actually care about and what I’ve seen go wrong.

Where I’d be honest about the limits

I don’t want to oversell this. Here’s what I’ve learned it can’t do:

It’s not real security testing. The security specialist checks for obvious stuff — auth, permissions, data showing up where it shouldn’t. It’s not doing pen testing or checking for server-side vulnerabilities. If you need real security testing, hire someone.

Visual evaluation is weak. AI isn’t great at judging how things look from screenshots. It catches obvious stuff (missing elements, broken layout) but misses subtle things (slightly off spacing, colors that are technically fine but look bad).

It’s not the same every time. Running the same journey twice might give slightly different findings. The AI might read a step differently or flag something one run and miss it the next. That’s the trade-off with natural language journeys instead of coded tests.

There are false positives. Some findings are noise. The specialist might flag something that’s on purpose, or report something that doesn’t matter. I read every report and decide what to act on — I don’t just create issues for everything.

Still — I’m one person doing dev, QA, UX, security, all of it. Having 7 AI agents look at every page from different angles catches things I’d miss. It’s not a replacement for any of those jobs, but it’s better than not doing them at all.