Per-Tenant Encryption & Key Management

Per-tenant encryption gives every tenant a distinct cryptographic boundary so that one tenant's plaintext cannot be recovered with another tenant's key, and so that deleting a single key destroys exactly one tenant's data. It is the encryption layer of the broader Multi-Tenant Compliance & Data Governance practice, and the mechanism most contracts mean when they demand "logical separation," "customer-managed keys," or "the right to be forgotten with cryptographic proof."

The pattern that makes this practical at scale is envelope encryption: a per-tenant data key encrypts the tenant's rows and blobs, and a customer master key (CMK) held in a hardware-backed key management service wraps that data key. The CMK never leaves the KMS, the data key is cheap to generate and cache, and revoking access reduces to disabling or destroying one wrapping key. Get the key hierarchy and rotation right and a single service encrypts millions of objects across thousands of tenants with a few milliseconds of overhead per request. Get it wrong and you either leak across the tenant boundary or lose a tenant's data permanently with no recovery path.

Prerequisites

Confirm the following before you encrypt a single column. Skipping any of these produces either a key you cannot rotate, a cache that leaks plaintext, or a deletion you cannot prove.

[ ] A KMS with envelope-encryption primitives — AWS KMS, Google Cloud KMS, Azure Key Vault, or HashiCorp Vault Transit. The CMK must be non-exportable and hardware-backed (FIPS 140-2/3 Level 3).
[ ] One CMK per tenant, or a CMK per tenant tier with per-tenant data keys — decide the granularity before provisioning, because re-keying later is expensive.
[ ] A tenant_id uuid NOT NULL on every encrypted table, and a key_metadata table mapping tenant_id to its current CMK ARN/resource name and active data-key version.
[ ] An authenticated-encryption cipher: AES-256-GCM or ChaCha20-Poly1305. Never AES-CBC without a separate MAC, never ECB.
[ ] A short-lived in-process data-key cache (seconds to low minutes) with an explicit TTL and a hard memory zeroing path on eviction.
[ ] IAM scoped so the application role can call Decrypt/GenerateDataKey on a tenant's CMK only within that tenant's request context — never a wildcard on all keys.
[ ] An immutable audit sink for every Decrypt, GenerateDataKey, rotation, and key-disable event, fed from KMS CloudTrail/audit logs.

Step-by-Step Implementation

The work breaks into five ordered steps: provision the per-tenant CMK, generate and wrap a data key, encrypt and persist with the wrapped key, decrypt through a guarded cache, and rotate without rewriting history. Run them in this order — encrypting before the key hierarchy and metadata exist leaves you with ciphertext you cannot attribute to a key.

1. Provision a customer master key per tenant

Each tenant gets a CMK that never leaves the KMS. Tag it with the tenant id so IAM policies and audit queries can scope to it. For bring-your-own-key tenants you import their key material into this same key resource instead of letting KMS generate it.

import boto3

kms = boto3.client("kms")

def provision_tenant_cmk(tenant_id: str) -> str:
    resp = kms.create_key(
        Description=f"CMK for tenant {tenant_id}",
        KeyUsage="ENCRYPT_DECRYPT",
        KeySpec="SYMMETRIC_DEFAULT",
        Origin="AWS_KMS",  # use "EXTERNAL" for BYOK imported material
        Tags=[{"TagKey": "tenant_id", "TagValue": tenant_id}],
    )
    key_arn = resp["KeyMetadata"]["Arn"]
    kms.create_alias(AliasName=f"alias/tenant/{tenant_id}", TargetKeyId=key_arn)
    return key_arn

2. Generate a data key and store only its wrapped form

GenerateDataKey returns both a plaintext data key (used immediately, never persisted) and a ciphertext blob (the data key wrapped by the CMK, safe to store). Persist only the wrapped blob alongside the tenant's metadata. The plaintext exists in memory just long enough to encrypt.

def issue_data_key(key_arn: str, tenant_id: str) -> tuple[bytes, bytes]:
    resp = kms.generate_data_key(
        KeyId=key_arn,
        KeySpec="AES_256",
        EncryptionContext={"tenant_id": tenant_id},  # bound into the wrap
    )
    plaintext_key = resp["Plaintext"]        # 32 bytes, use then zero
    wrapped_key = resp["CiphertextBlob"]     # store this, never the plaintext
    return plaintext_key, wrapped_key

3. Encrypt with AES-256-GCM and persist ciphertext plus the wrapped key

Use authenticated encryption so any tampering is detected on decrypt. Store the nonce, the GCM tag, the ciphertext, and the wrapped data key together. Bind the tenant_id into the GCM associated data so a ciphertext copied into another tenant's row fails authentication.

import os
from cryptography.hazmat.primitives.ciphers.aead import AESGCM

def encrypt_field(plaintext: bytes, plaintext_key: bytes, tenant_id: str) -> dict:
    nonce = os.urandom(12)
    aesgcm = AESGCM(plaintext_key)
    aad = tenant_id.encode("utf-8")
    ciphertext = aesgcm.encrypt(nonce, plaintext, aad)  # tag appended
    # Caller zeroes plaintext_key immediately after this returns.
    return {"nonce": nonce, "ciphertext": ciphertext, "aad": aad}

4. Decrypt through a guarded, short-lived data-key cache

Unwrapping the data key on every read is correct but slow and expensive in KMS calls. Cache the plaintext data key in process for seconds, keyed by tenant and data-key version, with a hard TTL. Pass the same EncryptionContext on Decrypt that you used on generate — a mismatch fails closed.

import time

_cache: dict[str, tuple[bytes, float]] = {}
_TTL_SECONDS = 60

def unwrap_data_key(wrapped_key: bytes, tenant_id: str, version: str) -> bytes:
    cache_key = f"{tenant_id}:{version}"
    hit = _cache.get(cache_key)
    if hit and hit[1] > time.monotonic():
        return hit[0]
    resp = kms.decrypt(
        CiphertextBlob=wrapped_key,
        EncryptionContext={"tenant_id": tenant_id},  # must match generate
    )
    plaintext_key = resp["Plaintext"]
    _cache[cache_key] = (plaintext_key, time.monotonic() + _TTL_SECONDS)
    return plaintext_key

5. Rotate by issuing a new data key, not by rewriting every row

Rotation re-wraps or re-issues the data key without re-encrypting existing ciphertext eagerly. Stamp each row with the data-key version it was written under, generate a new version, and decrypt-old / encrypt-new lazily on next write (or as a background backfill). The CMK itself can rotate independently because it only ever wraps data keys.

def rotate_tenant_data_key(key_arn: str, tenant_id: str, store) -> str:
    plaintext_key, wrapped_key = issue_data_key(key_arn, tenant_id)
    version = store.next_data_key_version(tenant_id)
    store.save_wrapped_key(tenant_id, version, wrapped_key, active=True)
    store.deactivate_previous_versions(tenant_id, keep=version)
    # Old versions stay readable until backfill re-encrypts their rows.
    del plaintext_key  # zero in real code via a mutable buffer
    return version

KMS handles the hard parts — hardware custody of the CMK, non-exportability, and tamper-evident audit logging. The granularity decision, IAM scoping, and BYOK/HYOK import flows are covered in depth in managing per-tenant encryption keys with KMS. Every Decrypt and rotation event above should land in the same pipeline you use for tenant audit logging architecture, because key access is a compliance event.

Choosing a Key Granularity and Custody Model

The two decisions that shape everything else are how finely you split keys and who holds custody. A CMK per tenant gives the cleanest crypto-shredding story and the strongest separation; a shared CMK with per-tenant data keys is cheaper and faster to provision. Custody ranges from provider-managed through BYOK (you import key material, KMS holds it) to HYOK (the key never leaves the customer's HSM and every decrypt is a remote call to them).

Factor	CMK per tenant	Shared CMK + per-tenant data keys	BYOK	HYOK
Tenant isolation	Strongest	Strong (data-key level)	Strong	Strongest (external custody)
Crypto-shred granularity	One tenant exactly	One tenant exactly	One tenant exactly	One tenant exactly
Provisioning cost	Per-key KMS charge	One CMK, cheap	Import overhead	Integration heavy
Rotation control	Per tenant	Shared CMK rotates all	Customer-driven	Customer-driven
Latency	KMS call per cold key	KMS call per cold key	KMS call per cold key	Round-trip to customer HSM
Best fit	Regulated mid-market	Many small tenants	Enterprise mandate	Sovereignty / zero-trust-of-vendor

The deciding question is who the auditor needs to trust. If the contract says the customer must be able to revoke your access unilaterally, you need at least BYOK and often HYOK. If the goal is dense, cheap encryption with a clean per-tenant delete, a shared CMK with per-tenant data keys is usually enough.

The Envelope Key Hierarchy

The concept that trips teams up is the chain from the hardware-held master key down to the bytes on disk, and the inverse path on read. The figure below traces both directions and marks where the plaintext data key briefly lives — the one place a leak turns into cross-tenant exposure.

The CMK stays in hardware and only ever wraps data keys; the brief life of the plaintext data key in cache is the one window where a misconfiguration becomes cross-tenant exposure.

Dynamic Query Scoping & Connection Handling

Encryption is metadata-driven at request time. Every read must resolve the tenant's active CMK and the data-key version a given row was written under before it can decrypt. That means the key_metadata lookup is on the hot path, and it must itself be tenant-scoped so one tenant's request can never resolve another tenant's key reference.

Field-level versus row-level encryption changes the scoping. Field-level (encrypt only sensitive columns) keeps the rest of the row queryable and indexable but means the query planner cannot filter on the encrypted column — you index a blind index or HMAC of the value instead. Row-level or blob-level encryption hides everything and is simplest for documents and attachments where you never query the contents.

SELECT k.cmk_arn, k.active_version
FROM key_metadata k
WHERE k.tenant_id = current_setting('app.tenant_id', true)::uuid;

The data-key cache lives per process, never shared across tenants in a single map without the tenant id in the cache key. A connection pool returning a connection to another tenant's request must not also return a warm plaintext key — keep the key cache keyed by tenant_id:version and never by connection. For how the tenant id arrives on the connection in the first place, the routing pillar covers tenant-aware data routing and query scoping end to end.

Security Enforcement & Access Control

Encryption only isolates tenants if the layers above it are scoped. The CMK is the boundary; IAM is what stops the application from reaching across it. The roles below must be genuinely distinct grants, never one role with a wildcard that can decrypt every tenant.

Layer	Mechanism	Enforced by	Failure if absent
Edge	tenant_id from JWT / subdomain	Gateway	Decrypt runs as no tenant
Key resolution	`key_metadata` scoped by tenant_id	Database / RLS	Wrong CMK resolved
IAM	`Decrypt` allowed only on that tenant's CMK	KMS key policy + IAM	One role decrypts all tenants
Cipher binding	`EncryptionContext` / GCM AAD = tenant_id	KMS + cipher	Ciphertext replays across tenants
Audit	every Decrypt / rotate logged immutably	KMS audit log	Untracked key access

Bind the tenant_id into both the KMS EncryptionContext and the GCM associated data. That single control means a ciphertext blob physically copied into another tenant's row cannot be decrypted — the context mismatch fails the unwrap, and the AAD mismatch fails the authentication. Tie the key-policy grants into the broader auth and cross-tenant access control model so the application's KMS permissions are reviewed alongside every other privileged path.

Operational Overhead & Scaling Metrics

Envelope encryption adds a KMS round-trip on cold reads and a few microseconds of AES-GCM per field. The cache turns thousands of decrypts into one. Watch the following and act at the thresholds.

Metric	Healthy	Warning threshold	Mitigation
KMS Decrypt calls / request	<0.05 (cache-served)	>0.5 sustained	Raise cache TTL; check cache key includes version
Data-key cache hit rate	>95%	<80%	Tune TTL; pre-warm hot tenants
Decrypt p99 latency added	<5 ms	>25 ms	Cache plaintext key; co-locate KMS region
KMS request throttling	none	`ThrottlingException` seen	Request quota increase; batch via data keys
Key versions per tenant	1–3 active	many old versions readable	Backfill re-encrypt to retire old versions

The single highest-leverage action is the data-key cache. Without it, every read is a KMS call, which is both slow and rate-limited; with a short TTL keyed by tenant_id:version, you collapse a tenant's read burst into one unwrap while keeping the blast radius of a cached key to seconds.

Crypto-Shredding for Deletion

The strongest reason to give each tenant its own key is deletion. Crypto-shredding satisfies "right to erasure" by destroying the key instead of overwriting every copy of the ciphertext — backups, replicas, and cold storage become permanently unreadable the moment the wrapping key is gone, with no need to reach into immutable archives.

def crypto_shred_tenant(key_arn: str, tenant_id: str, store) -> None:
    # 1. Schedule CMK deletion (KMS enforces a waiting period).
    kms.schedule_key_deletion(KeyId=key_arn, PendingWindowInDays=7)
    # 2. Drop every wrapped data-key version for the tenant.
    store.delete_all_wrapped_keys(tenant_id)
    # 3. Purge the in-process cache so no warm plaintext key survives.
    for k in [c for c in _cache if c.startswith(f"{tenant_id}:")]:
        _cache.pop(k, None)
    # Ciphertext may remain in backups; it is now undecryptable.

Crypto-shredding is the cryptographic complement to record-level deletion. Use it together with per-tenant data deletion workflows when fulfilling a GDPR data subject request — delete the live records for immediate effect, then destroy the key so backups and replicas are provably unrecoverable. Log the key-destruction event as the auditable proof of erasure.

Pitfalls & Anti-Patterns

One key for all tenants. A single CMK or data key shared across tenants collapses the entire premise: you cannot crypto-shred one tenant without destroying all of them, and a compromised key exposes every tenant at once. The key boundary must follow the tenant boundary.

Caching the plaintext data key too long. A long-lived plaintext key cache erodes the security envelope — the data key now lives in memory far past the request that needed it, widening the window for a memory disclosure. Keep the TTL in seconds, key it by tenant_id:version, and zero the buffer on eviction.

Wildcard IAM on Decrypt. An application role with kms:Decrypt on * can read every tenant's data regardless of how cleanly the keys are split. Scope the key policy and IAM to the specific CMK in the request's tenant context, never a wildcard.

Omitting the encryption context. Generating a data key with an EncryptionContext but decrypting without it (or with a different one) either fails or, worse, lets a ciphertext move between tenants undetected. Bind tenant_id into both the KMS context and the GCM AAD, and pass it identically on both paths.

Rotating by rewriting everything synchronously. Re-encrypting every row the instant you rotate a key turns a routine operation into a multi-hour table rewrite that locks tenants out. Stamp rows with a data-key version, issue a new version, and re-encrypt lazily or as a throttled backfill.

Frequently Asked Questions

Why envelope encryption instead of encrypting directly with the CMK? A CMK in a KMS cannot encrypt large payloads directly and every operation is a network call you pay for and get rate-limited on. Envelope encryption uses the CMK only to wrap a cheap, locally usable data key, so you make one KMS call per cold key instead of one per object, and the master key never leaves hardware.

Does per-tenant encryption let me query encrypted columns? Not directly — an encrypted column is opaque to the planner. For equality lookups, store a deterministic HMAC of the value as a separate blind-index column and query that; for ranges or full-text you must decrypt application-side. This is why most teams encrypt only sensitive fields and leave the rest queryable.

What is the difference between BYOK and HYOK? With BYOK you import your own key material into the provider's KMS, which then holds and uses it — the provider can still perform decrypts on your behalf. With HYOK the key never leaves your own HSM, so every decrypt is a remote call to you and the provider literally cannot read your data without your participation. HYOK is stronger isolation at a real latency and integration cost.

How does crypto-shredding satisfy GDPR erasure if the ciphertext still exists in backups? Erasure requires that the data can no longer be reconstructed, not that every byte is physically overwritten. Destroying the only key that can decrypt a tenant's ciphertext renders that ciphertext permanently unreadable everywhere it exists, including backups and replicas, which regulators accept as effective erasure when paired with logging the key-destruction event.

How often should I rotate keys, and does rotation re-encrypt my data? Rotate the CMK on the schedule your policy or compliance regime requires (commonly annually or on suspected compromise); rotate data keys more freely. Neither forces an immediate re-encryption — the CMK only wraps data keys, and data keys are versioned per row, so existing ciphertext stays readable under its old version while new writes use the new one.