I've Been Shipping 'Multi-Tenant' Wrong for a Decade

I’ve been writing “multi-tenant” applications since around 2014. Each time, I told myself this one would be done properly. Each time, “properly” meant a tenant_id column on every table, a base repository that injected WHERE tenant_id = $1, and a strongly worded paragraph in the onboarding doc.

That is not multi-tenancy. That is tenant awareness, enforced by discipline and code review.

This is the first project where I think I’ve actually done it right. The thing that changed is not my discipline. It is where the contract lives.

A chibi developer holding a SQL query scroll being refused by a paper Postgres elephant guarding labeled, padlocked tenant drawers

What I used to call multi-tenant

The pattern I shipped, over and over:

A tenant_id column on every business table.
A base repository or ORM scope that automatically adds WHERE tenant_id = :current.
A request middleware that sets :current from the session.
A pull request template asking you to confirm new queries include the scope.

I have, at various points, written base repository decorators, ORM scopes, query interceptors, and SQL preprocessors that all tried to make this enforcement automatic. Each one was forty to four hundred lines of cleverness, and each one had a handful of places that bypassed it within the first six months: usually a raw SQL query somewhere, sometimes a primary-key lookup that the ORM treated as obviously-safe, occasionally a background job that loaded a record “by ID, because the ID is unguessable anyway.” Always an admin script.

It works until it doesn’t. The cases where it broke for me, sorted by how embarrassing the post-mortem was:

A new raw SQL query written outside the repository. Forgot the WHERE. Shipped.
A JOIN to a table that has its own tenant_id and a different alias. Scope applied to one side, not the other.
A background job that loaded a record by primary key, ignoring scope.
A reporting query that aggregated across all tenants on purpose, then got copy-pasted into a per-tenant endpoint a year later.
An admin script. Always an admin script.
An ORM find_by_id that bypassed the scope because the framework treated primary-key lookups as obviously safe.

In every case the application was tenant-aware. It knew about tenants. It usually filtered by tenant. But the database itself did not care. If a single query forgot, the rows came back wrong.

Every multi-tenant SaaS that has not moved the check into the database is one careless query away from a cross-tenant leak. I am not going to name the codebases.

Moving the contract into the database

PostgreSQL has Row-Level Security. It is not new. It has been in core since 9.5, which is 2016. I had read about it. I had not used it on a real project until now.

The shape of it is small. For every tenant-scoped table:

ALTER TABLE orders ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON orders
    USING (tenant_id = current_setting('app.current_tenant_id')::uuid);

That is the policy. From now on, every SELECT, UPDATE, DELETE against orders is scoped by the database to rows where tenant_id matches the value of the session variable app.current_tenant_id.

The application is no longer responsible for filtering. The application is responsible for setting one variable correctly per request. That is a much smaller surface to get wrong.

The request middleware, in Go, looks like this:

tx, err := pool.Begin(r.Context())
if err != nil { /* ... */ }

_, err = tx.Exec(r.Context(),
    "SELECT set_config('app.current_tenant_id', $1, true)",
    tenantID,
)

Two notes on what is not obvious there:

The query runs inside a transaction. set_config(..., true) is the parameterized equivalent of SET LOCAL, and SET LOCAL is transaction-scoped. If I set it on the connection instead, the value would leak to the next request that picks up the same pooled connection. I do not want to find that bug in production.
tenantID comes from a verified JWT claim, validated as a UUID before it hits the database. I did not want the UUID check to be in SQL.

The handler that runs inside this middleware uses the transaction directly. It does not write WHERE tenant_id = ? anywhere. That clause is enforced one layer down, by the database, on every query, with no way for the handler to opt out.

That is the headline. The rest of this post is the four things I got wrong before I got it right.

The four things I missed

1. The table owner bypasses RLS by default

I enabled RLS, wrote my policy, ran the test, and it passed. Then I ran a second test where the connecting user was the database owner, which is what my application used, and the policy did nothing. Every row came back.

PostgreSQL has a quiet rule: the role that owns the table is exempt from the table’s RLS policies unless you say otherwise. The fix is one line per table:

ALTER TABLE orders FORCE ROW LEVEL SECURITY;

I have a whole migration that does nothing but apply FORCE to a dozen tables I missed on the first pass.

2. Superusers bypass RLS even with FORCE

FORCE makes the owner respect the policy. It does not stop a superuser. A fresh PostgreSQL install ships with one role: the bootstrap superuser created by initdb, usually called postgres. Most applications end up connecting as that role because every tutorial does. A superuser bypasses every RLS policy, with or without FORCE. Database ownership and superuser status are separate properties; you have to address both.

So I created a second role:

CREATE ROLE app_user
    LOGIN PASSWORD '...'
    NOSUPERUSER NOINHERIT NOCREATEDB NOCREATEROLE;

GRANT USAGE ON SCHEMA public TO app_user;
GRANT SELECT, INSERT, UPDATE, DELETE
    ON ALL TABLES IN SCHEMA public TO app_user;

Migrations run as the superuser. The application connects as app_user, which is not a superuser, owns nothing, and is subject to every policy. The application connection string never sees the superuser credentials.

This is the change that, in retrospect, I should have made on every multi-tenant project I have ever shipped. It is twenty lines of SQL.

3. The default of “no setting” is not “deny”

My first policy was the version above:

USING (tenant_id = current_setting('app.current_tenant_id')::uuid)

It looked correct. It was not. If app.current_tenant_id is not set on the connection, which happens for any code path that forgets to apply the middleware, current_setting raises an error. The query fails. That sounds fine: fail loud, find the bug. It is the wrong default.

The 500 is the wrong failure mode. A background job that forgot the middleware, a test fixture missing a setup step, a code path that bypasses the middleware: all of these should return zero rows, not an exception. An exception is a bug I have to chase across services. A default-deny match is a security property the database carries on its own, with no application help.

The version that defaults to deny:

CREATE POLICY tenant_isolation ON orders
    USING (
        tenant_id = COALESCE(
            NULLIF(current_setting('app.current_tenant_id', true), '')::uuid,
            '00000000-0000-0000-0000-000000000000'::uuid
        )
    );

current_setting('x', true) returns NULL instead of raising. NULLIF(..., '') collapses empty strings to NULL. COALESCE(..., nil-uuid) substitutes a UUID that does not match any real row. If the middleware is not applied, every query returns zero rows. Loud and visible in tests; safe in production.

4. Adding a new table is a place to forget

Even with all of the above, the policy still lives in a migration. Every new table that holds tenant data needs:

A tenant_id column.
ENABLE ROW LEVEL SECURITY.
FORCE ROW LEVEL SECURITY.
A tenant_isolation policy using the safe COALESCE pattern.
A grant to the application role.

Five things. Forgetting any of them is silent. The database does not warn me that I created a table with no policy. It just lets me read across tenants.

I do not have a clean answer for this. I have a checklist in the contributor guide, a test that enumerates every table and asserts a policy exists, and a code review habit. None of these is the database telling me.

The bug I am happy I found

A while back I ran a structured pass over every RLS policy in the codebase, comparing each one against the canonical pattern. One early migration contained this:

CREATE POLICY tenant_isolation ON notifications
    USING (tenant_id = current_setting('app.tenant_id')::uuid);

Look at the setting name. The application sets app.current_tenant_id. The policy reads app.tenant_id. They are not the same string.

That policy had been live for months. Every query against notifications raised unrecognized configuration parameter "app.tenant_id" at execution time. The application name and the policy name did not match, and current_setting did not have missing_ok set. The reason I never noticed is that the feature was new and lightly used, and the error path was soft: a notification fetch that errors gets logged and falls back to “no notifications.” The bug surfaced to the user as an empty list, not a 500 page. It looked like a feature not finished, not a security bug.

The audit caught it. I wrote a follow-up migration that fixes the name, applies FORCE, and switches to the safe default-deny pattern.

This is the bug I would never have caught with discipline-based multi-tenancy. The same misspelling in an application-level WHERE clause would have shipped data the wrong way for a year. Instead, the policy errored out on every call. No data ever flowed the wrong way; the feature just looked half-built until somebody looked.

That is the value of moving the contract down. The failure modes are different. Misspelled policy: 500 error, fix it. Misspelled WHERE: silent cross-tenant read, find out from a customer.

What does not work

I would be lying if I said this is free.

Transaction-per-request is the cost of SET LOCAL. Every authenticated request begins a transaction, sets the variable, runs the handler, and commits. For read-only handlers I let the response stream while the transaction is open and commit afterwards. For mutations I buffer the response, commit, and only flush bytes to the client if the commit succeeded. That second case adds memory pressure for large responses, and I have a couple of streaming endpoints that opt out of the middleware deliberately. They use a different, audited path.
There is still a bypass path. Platform support has to act across tenants sometimes: to debug, to restore, to migrate. That path uses the superuser connection and is exempt from RLS by construction. I audit it, log every action, and keep the surface as small as I can. Pretending the bypass does not exist would be dishonest. It is an attack surface; it is just a small, observable one.
The connection pool has to be the right pool. The application has two pools: one as the restricted role for normal traffic, one as the superuser for migrations and platform jobs. The wrong code path picking up the wrong pool would erase the entire benefit. I have a single function that hands out pools, and the superuser pool is only injected into a small set of components. It is a thin guarantee that needs a test, not a comment.
Old migrations stay wrong until you re-audit them. The misspelling above survived several “fix RLS in general” migrations because each one only touched the tables that broke its own test. The first migration that mentions a table is the one that has to be right; later sweep migrations rarely catch every table. I have a test now that walks every table and checks the policy. I should have had it on day one.
You still need tenant_id everywhere. RLS does not invent the column. Every table that holds tenant data carries an explicit tenant_id. Joins still need to mention it where the planner cannot infer scoping through a foreign key. The data model does not get simpler. The enforcement does.

Tenant-aware vs tenant-isolated

The reframe, for me, is small but blunt.

A tenant-aware system knows about tenants. It almost always filters correctly. It depends on every developer, every query, every code path, and every background job remembering. The failure mode is silent.

A tenant-isolated system has the check enforced by a layer that the application cannot bypass. The application is responsible for one thing: telling the database which tenant the current request belongs to. The failure modes are loud: a missing setting denies all rows; a misspelled policy raises an error; a forgotten policy on a new table fails a startup test.

I have shipped tenant-aware systems for ten years and called them multi-tenant. They were not. The first one I would let an auditor read is this one, and the reason is not that I got more careful. The reason is that the database now refuses to let me be careless.

If I had to compress everything above into a sentence I would want my past self to read: do not put your tenant boundary in your application code. Put it in the database role, the table policy, and the session variable. Then build the application against a database that will not return the wrong rows.