Role-Based Access Control Per Tenant
Per-tenant RBAC is the discipline of resolving and enforcing role-to-permission grants that are scoped to a single tenant, so that an administrator in one tenant can never act on another tenant's resources; it operates within the broader Auth Isolation & Cross-Tenant Access Control framework that governs how identity, sessions, and tokens stay separated across tenants.
The hard part is not defining roles — it is making sure that every permission check answers two questions at once: what can this role do and which tenant is it allowed to do it in. A role named admin means nothing on its own. It only becomes a grant when it is resolved against a specific tenant_id, and that resolution has to happen on every request, in every service, without a single code path that forgets the tenant dimension. Get the resolution model wrong and you do not get a bug — you get a cross-tenant breach.
This page walks the full implementation: the prerequisites, a step-by-step build of the resolution and enforcement layers, the query-scoping and connection mechanics that keep checks honest, the security layer that closes escalation paths, and the operational metrics that tell you whether your cache and policy engine are keeping up. Two decisions deserve their own deep dives — how you shape the role and permission schema itself, covered in designing tenant-scoped permission models, and how you prove what changed and when, covered in auditing RBAC changes across tenants.
Prerequisites
Before wiring per-tenant RBAC into a live system, confirm the following are in place. Each one is load-bearing; skipping any of them pushes enforcement to a layer that cannot see the tenant boundary.
- [ ] A trusted source of tenant identity on every request — a signed claim, not a client-supplied header. Align this with tenant-aware JWT & token management so the
tenant_idis cryptographically bound to the session. - [ ] A persistent store for role assignments keyed by
(tenant_id, user_id, role), on PostgreSQL 14+ or an equivalent with composite-key indexing. - [ ] A policy engine: Open Policy Agent (OPA) 0.60+, AWS Cedar, or an in-process evaluator. This guide shows all three styles.
- [ ] A distributed cache — Redis 7+ — for compiled permission sets, with pub/sub for invalidation.
- [ ] Request-scoped context propagation (Node
AsyncLocalStorage, Gocontext.Context, Pythoncontextvars) so the active tenant never has to be passed by hand. - [ ] An append-only audit sink (Kafka topic, partitioned table, or SIEM) for every grant, revoke, and denied check.
- [ ] Framework versions confirmed: Express 4.18+/5.x, Prisma 5+, or Spring Security 6 — the examples below assume these.
Step-by-Step Implementation
The flow has five ordered stages. Each runs before the next; a failure at any stage is a hard stop, never a fall-through to a default.
1. Resolve the tenant from a trusted claim
Tenant resolution comes first because every later check is scoped by it. Prefer the signed token claim over any routing hint. If you also read a subdomain or X-Tenant-ID header for routing, treat them as untrusted and require them to match the token claim — a mismatch is an attack signal, not a recoverable state.
import { Request, Response, NextFunction } from 'express';
import { AsyncLocalStorage } from 'node:async_hooks';
export const tenantContext = new AsyncLocalStorage<{ tenantId: string; userId: string }>();
export function resolveTenant(req: Request, res: Response, next: NextFunction) {
const claimTenant = req.auth?.tenantId; // verified JWT claim
const routedTenant = (req.headers['x-tenant-id'] as string) || req.subdomains[0];
if (!claimTenant) {
return res.status(401).json({ error: 'No tenant claim in token' });
}
if (routedTenant && routedTenant !== claimTenant) {
// Someone is asking to act in a tenant their token does not authorize.
return res.status(403).json({ error: 'Tenant routing/claim mismatch' });
}
tenantContext.run({ tenantId: claimTenant, userId: req.auth.userId }, () => next());
}
2. Load the user's role assignments for that tenant
Roles are looked up by the composite key. A user may hold different roles in different tenants, so the query must filter on both tenant_id and user_id. Never cache role assignments without the tenant in the cache key.
SELECT role
FROM tenant_role_assignments
WHERE tenant_id = $1
AND user_id = $2;
-- Backing index:
-- CREATE INDEX idx_tra_lookup ON tenant_role_assignments (tenant_id, user_id);
3. Compile the role into a permission set
Roles are indirection; permissions are what you actually check. Compile each role into a flat, hashable permission set so that the runtime check is a constant-time membership test rather than a graph walk. A permission is an action:resource pair scoped implicitly by the tenant context already on the request.
type Permission = `${string}:${string}`; // e.g. "invoice:read"
const ROLE_PERMISSIONS: Record<string, Permission[]> = {
admin: ['invoice:read', 'invoice:write', 'member:invite', 'member:remove'],
editor: ['invoice:read', 'invoice:write'],
viewer: ['invoice:read'],
};
export function compilePermissions(roles: string[]): Set<Permission> {
const set = new Set<Permission>();
for (const role of roles) {
for (const perm of ROLE_PERMISSIONS[role] ?? []) set.add(perm);
}
return set; // O(1) membership checks downstream
}
4. Evaluate the access decision
The evaluation step takes the compiled set, the action being attempted, and the resource type. The default branch must be deny — any unknown action, missing role, or parse error returns false. The example uses an in-process evaluator; for declarative policy across many services, route the same inputs through OPA.
def evaluate_access(permissions: set[str], action: str, resource: str) -> bool:
"""Constant-time check. Unknown inputs deny by construction."""
return f"{resource}:{action}" in permissions
For policy-as-code, the equivalent Rego keeps the tenant explicit in the input document and defaults to deny:
# policy.rego
package rbac
default allow = false
allow {
input.tenant_id == input.resource.tenant_id # never cross the boundary
perm := sprintf("%s:%s", [input.resource.kind, input.action])
input.permissions[perm]
}
5. Enforce, then audit the decision
The guard wraps a route or RPC handler. Every decision — allow and deny alike — is written to the audit sink with the tenant, user, action, and outcome. Denied checks are the early-warning system for probing; do not drop them.
export function requirePermission(perm: Permission) {
return (req: Request, res: Response, next: NextFunction) => {
const { tenantId, userId } = tenantContext.getStore()!;
const allowed = req.permissions.has(perm);
audit.emit({ tenantId, userId, action: perm, allowed, at: Date.now() });
if (!allowed) return res.status(403).json({ error: 'Forbidden' });
next();
};
}
The figure below shows how a single request threads these stages, and where the tenant boundary is enforced at each hop.
Choosing an Evaluation Model
Three approaches dominate per-tenant RBAC. The right one depends on how many services need to share policy and how dynamic your rules are.
| Model | Where it runs | Tenant scoping | Latency | Best fit |
|---|---|---|---|---|
| In-process Set lookup | Inside each service | Implicit via request context | Sub-microsecond | Monoliths, single hot path, few rules |
| OPA (Rego) | Sidecar or library | Explicit in input document |
1–3 ms (cached bundles) | Many services sharing one policy |
| AWS Cedar | Library / Verified Permissions | Explicit in entity store | 1–2 ms | Hierarchical resources, formal analysis |
For most teams the pragmatic path is in-process Set lookups behind a cache for the hot read path, with OPA or Cedar reserved for policies that must be authored once and enforced across many services. Mixing them is fine as long as the deny-by-default contract is identical in both.
There is a second axis that the table does not capture: how dynamic the grants are. Static role-to-permission maps — admin, editor, viewer — compile cleanly into Sets and rarely change, so an in-process lookup with a long cache TTL is ideal. Attribute-based rules — a user may approve an invoice only below their own spend limit, or only during business hours in the tenant's region — cannot be flattened into a Set ahead of time because the decision depends on request-time attributes. Those belong in OPA or Cedar, where the policy receives the full input document and can reason over it. The mistake is forcing attribute logic into the Set model with an explosion of synthetic permission strings; the maintenance cost grows quadratically and the audit trail becomes unreadable. Keep coarse role grants in the fast path and push genuinely conditional logic into the policy engine.
Whatever model you pick, the role and permission schema underneath it is the decision that ages worst if rushed — role hierarchies, permission granularity, and how to avoid a combinatorial matrix all need to be settled before the first grant is issued.
Dynamic Query Scoping & Connection Handling
A permission check that passes the application layer but lets a query read another tenant's rows is worthless. Enforcement has to reach the data layer, and the cleanest way is to inject the tenant filter where queries are built rather than trusting every caller to add a WHERE tenant_id = ... clause. This is the same principle that governs the broader tenant-aware data routing & query scoping layer.
A Prisma client extension reads the active tenant from context and refuses to run any query that lacks it — the absence of a tenant is a thrown error, never a silent full-table scan.
import { PrismaClient } from '@prisma/client';
import { tenantContext } from './tenant-context';
const base = new PrismaClient();
export const prisma = base.$extends({
query: {
$allModels: {
async $allOperations({ args, query }) {
const tenantId = tenantContext.getStore()?.tenantId;
if (!tenantId) throw new Error('Refusing unscoped query: no tenant context');
(args as any).where = { ...(args as any).where, tenantId };
return query(args);
},
},
},
});
For defense in depth, pair the application filter with PostgreSQL row-level security so the database rejects a cross-tenant read even if the application layer is bypassed. Set the tenant on the session at checkout:
ALTER TABLE invoices ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON invoices
USING (tenant_id = current_setting('app.tenant_id')::uuid);
Connection handling matters here: with transaction-mode pooling (PgBouncer), SET app.tenant_id must use SET LOCAL inside the transaction so the setting does not leak to the next borrower of the pooled connection. A session-level SET on a pooled connection is a classic cross-tenant bug — the next request inherits the previous tenant's scope.
Security Enforcement & Access Control
Per-tenant RBAC fails open in subtle ways. The controls below close the most common escalation paths.
| Layer | Control | What it stops |
|---|---|---|
| Token | Tenant claim signed and verified | Header-spoofed tenant switching |
| Routing | Reject claim/route mismatch | Acting in an unauthorized tenant |
| Role lookup | Composite (tenant_id, user_id) key |
Inheriting another tenant's roles |
| Evaluation | Default deny on any error | Implicit grant from a parse failure |
| Data | RLS + SET LOCAL per transaction |
Cross-tenant reads via pooled connections |
| Audit | Log allow and deny | Undetected probing and escalation |
Two rules deserve emphasis. First, the token claim is authoritative — any routing hint that disagrees with it is rejected, never reconciled. Second, revocation must be immediate. When a role is removed, the cached permission set has to be invalidated before the next request can use it; a stale grant is an open door. Tie revocation into session handling so that a role change also forces re-evaluation of active sessions, as described in the session isolation & state management layer. Map external identity-provider groups to internal roles through SSO mapping & identity federation, and keep that mapping tenant-scoped so an IdP group never grants a role outside its own tenant.
The cache layer is where revocation usually goes wrong. Compiled permission sets live in Redis under a tenant-and-role key with a bounded TTL, and a grant or revoke publishes an invalidation message that every node consumes.
import json, redis
r = redis.Redis(host="localhost", port=6379, db=0)
TTL = 900 # 15 minutes; the ceiling on staleness, not the norm
def get_permissions(tenant_id: str, role: str) -> dict:
key = f"rbac:{tenant_id}:{role}"
cached = r.get(key)
if cached:
return json.loads(cached)
matrix = fetch_matrix_from_db(tenant_id, role)
r.setex(key, TTL, json.dumps(matrix))
return matrix
def revoke(tenant_id: str, role: str) -> None:
key = f"rbac:{tenant_id}:{role}"
r.delete(key)
r.publish("rbac:invalidate", key) # other nodes drop their local copy
Operational Overhead & Scaling Metrics
Per-tenant RBAC is cheap when cached and expensive when it stampedes. Track these metrics and act at the thresholds.
| Metric | Healthy threshold | Mitigation when breached |
|---|---|---|
| Permission-check p99 latency | < 2 ms | Move evaluation in-process; precompile sets |
| Cache hit ratio | > 95% | Raise TTL or warm hot tenant/role pairs |
| Invalidation lag (publish to drop) | < 100 ms | Co-locate Redis; use a dedicated pub/sub channel |
| DB role-lookup QPS | < cache backstop capacity | Add jittered TTL to prevent synchronized expiry |
| Denied-check rate per tenant | Baseline + alert on spike | Treat spikes as probing; rate-limit the principal |
The dominant cost in a microservice fleet is not the check itself but propagating tenant context across service hops; budget for it in gRPC or HTTP metadata and version that schema deliberately. Cache invalidation storms are the other scaling cliff: stagger TTLs with jitter so thousands of keys do not expire in the same second and stampede the database.
Two failure shapes are worth instrumenting explicitly because they masquerade as healthy systems. The first is the silent stale grant: invalidation lag creeps above its threshold, a revoked role keeps working for tens of seconds, and nothing in your dashboards flags it because every check still returns a clean allow. Measure invalidation lag directly — timestamp the publish, timestamp the local drop, and alarm on the gap — rather than inferring it from cache hit ratio. The second is the denied-check spike: a sudden rise in 403s for one tenant is rarely a bug in your code. It is almost always a principal probing for permissions it does not have, often after a partial credential compromise. Route denied-check counts per tenant and per principal into your alerting, and wire a spike to a rate limit on the offending principal so probing is throttled rather than merely logged.
Cost scaling is close to linear in the number of services that must perform a check, because each adds one context-propagation hop and one cache lookup. It is sub-linear in tenant count as long as the cache key includes the tenant and hot tenants dominate traffic — a handful of large tenants will hold most of the working set, so warming their role pairs at deploy time keeps the cold-start tail short without preloading the entire estate into memory.
Pitfalls & Anti-Patterns
- Tenant-blind role definitions. A global
adminrole table with no tenant column makes every admin a super-admin the moment one tenant's data shares storage with another's. Always key assignments on(tenant_id, role_name)and resolve them against the request's tenant. - Filtering after fetch. Pulling rows and then dropping the ones that belong to other tenants means the data already crossed the boundary in memory, and a missed filter leaks it. Inject the tenant predicate at query construction, not in post-processing.
- Pooled connections with session-scoped tenant settings. Using
SET app.tenant_id(session scope) on a PgBouncer transaction-pooled connection leaks the setting to the next borrower. UseSET LOCALinside the transaction every time. - Defaulting to allow on error. A policy parse failure or missing role that returns
trueturns every bug into an authorization bypass. The default branch of every evaluator must be deny. - Caching without the tenant in the key. A cache key of
rbac:{role}instead ofrbac:{tenant}:{role}collapses every tenant's permissions into one entry — the first tenant to populate it defines access for all of them.
Frequently Asked Questions
How do I stop a tenant admin from acting on another tenant's resources?
Bind the tenant to a signed token claim, reject any request where a routing hint disagrees with the claim, and key every role lookup and cache entry on (tenant_id, ...). The boundary then travels with the request and no code path can resolve a role outside its own tenant.
Should I use OPA or an in-process check for permission evaluation? Use an in-process Set lookup on the hot path for lowest latency, and reach for OPA or Cedar when the same policy must be authored once and enforced across many services. The deny-by-default contract must be identical whichever you choose.
How fast must role revocation take effect? By the next request. Delete the compiled permission set from the cache and publish an invalidation so every node drops its local copy; tie this into session re-evaluation so an active session cannot keep using a removed role.
Does row-level security replace application-layer checks?
No — it backs them up. RLS catches a cross-tenant read if the application filter is bypassed, but it does not express action-level permissions like invoice:write. Run both, and use SET LOCAL so the tenant setting never leaks across pooled connections.