Data Masking System Low-Level Design – Tech Interview Dot Org

What is Data Masking?

Data masking replaces sensitive data (PII, payment cards, SSNs) with realistic-but-fake values for use in non-production environments, analytics, or when displaying to users with limited access. Use cases: (1) Developers need production-like data to test against; (2) Customer support agents should see “****4567” not the full card number; (3) Analytics exports should not contain real email addresses. Without masking, sensitive data leaks into systems that don’t need it — a major compliance and security risk.

Types of Data Masking

Static masking: create a masked copy of the database for non-production use. Run once when provisioning dev/test environments. The masked copy is persistent.
Dynamic masking: mask data at query time based on the requesting user’s role. The underlying data is unchanged; the masked view is presented to low-privilege users.
Tokenization: replace sensitive value with a random token. The mapping is stored in a secure vault. Can de-tokenize with the right credentials. Used for payment cards (PCI-DSS compliance).
Pseudonymization: deterministic substitution — same input always produces same output (HMAC-based). Allows correlation across records without revealing the original value.

Masking Techniques

import hashlib, secrets

# 1. Redaction: replace entirely
mask_email('user@example.com') → '***@***.***'

# 2. Partial masking: show last N chars
mask_card('4111111111111234') → '****-****-****-1234'
mask_phone('+14155551234')    → '+1415***1234'

# 3. Pseudonymization (deterministic, HMAC-based)
def pseudo_email(email, secret_key):
    h = hmac.new(secret_key, email.encode(), hashlib.sha256).hexdigest()[:8]
    domain = email.split('@')[1]
    return f'user_{h}@{domain}'
# Same email always maps to same pseudonym; different secret = different mapping

# 4. Tokenization
def tokenize(value, vault):
    token = vault.get_token(value)      # check existing mapping
    if not token:
        token = secrets.token_urlsafe(16)
        vault.store(token, value)        # store token → plaintext
    return token

def detokenize(token, vault):
    return vault.get(token)

Dynamic Masking at the DB Layer

-- PostgreSQL: masking view based on current user's role
CREATE VIEW users_masked AS
SELECT
    user_id,
    CASE WHEN current_setting('app.user_role') IN ('admin', 'finance')
         THEN email
         ELSE regexp_replace(email, '(.{2}).+(@.+)', '1***2')
    END AS email,
    name,
    CASE WHEN current_setting('app.user_role') = 'admin'
         THEN phone
         ELSE regexp_replace(phone, '(+d{4})d+(d{4})', '1***2')
    END AS phone
FROM users;

-- Application sets role before query
SET LOCAL app.user_role = 'support';
SELECT * FROM users_masked WHERE user_id = :id;
-- support agent sees: user@*** (masked email)

Static Masking Pipeline for Dev/Test

# Run after cloning production DB to dev environment
def mask_database(db_conn):
    # Define masking rules per table/column
    rules = {
        'users': {
            'email':    lambda v: pseudo_email(v, SECRET),
            'phone':    lambda v: mask_phone(v),
            'ssn':      lambda v: '***-**-' + v[-4:],
            'dob':      lambda v: v.replace(day=1),  # keep month/year, obscure day
        },
        'payment_methods': {
            'card_number': lambda v: '****-****-****-' + v[-4:],
            'cvv':         lambda v: '***',
        }
    }

    for table, columns in rules.items():
        rows = db_conn.query(f'SELECT * FROM {table}')
        for row in rows:
            updates = {col: fn(row[col]) for col, fn in columns.items()
                       if row[col] is not None}
            db_conn.update(table, row['id'], updates)

Key Design Decisions

Pseudonymization over random masking for analytics — deterministic mapping preserves analytical relationships (all rows with same email still correlate)
Dynamic masking via DB view — single source of truth; masking logic centralized, not scattered across application code
Tokenization for payment cards (PCI-DSS) — tokens can be de-tokenized for authorized operations; raw card numbers never stored in application DB
Column-level masking rules — mask only what’s necessary; over-masking breaks dev/test workflows
Secret key rotation — re-running pseudonymization with a new key produces different pseudonyms; rotate annually

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between data masking and encryption?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Encryption transforms data into ciphertext that can be decrypted back to the original with a key. Data masking replaces data with realistic-but-fake values — the original cannot be recovered from the masked value (for most masking types). Encryption protects data in transit and at rest; masking makes data safe for use in lower-trust environments (dev/test) or display to limited-privilege users. Tokenization is a hybrid: stores the mapping securely (like encryption) but presents a token (like masking).”}},{“@type”:”Question”,”name”:”What is pseudonymization and when do you use it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Pseudonymization is deterministic substitution: the same input always maps to the same output, using a keyed hash (HMAC). Example: user@example.com → user_a1b2c3d4@example.com (consistent). Unlike random masking, pseudonymized data preserves analytical relationships — you can correlate records with the same pseudonymized email across tables. Use it for: analytics exports where cross-row correlation must be preserved, GDPR-compliant data sharing, and dev/test databases where referential integrity must be maintained.”}},{“@type”:”Question”,”name”:”How does dynamic data masking work at the database layer?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”PostgreSQL (and SQL Server, Oracle) support views and row-level policies that mask data based on the querying user’s role. Create a view that applies masking functions (CASE WHEN role=admin THEN email ELSE mask(email) END). Application code reads from the view, not the base table. Low-privilege users (customer support) see masked values; admins see real values. The underlying data is never modified — masking happens at query time. PostgreSQL 16 added native dynamic data masking; earlier versions use views.”}},{“@type”:”Question”,”name”:”What is tokenization and why does PCI-DSS require it for payment cards?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Tokenization replaces a sensitive value (credit card number) with a random token stored in a secure vault. The token has no mathematical relationship to the card number — even if stolen, it’s useless without access to the vault. PCI-DSS scope reduction: if your application only stores and processes tokens, not real card numbers, the compliance scope is dramatically reduced (fewer systems need PCI audits). The vault provider (Stripe, Braintree) holds the actual card numbers and handles PCI compliance.”}},{“@type”:”Question”,”name”:”How do you mask a production database for use in dev/test?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Build a static masking pipeline: clone the production DB to a staging environment, then run a masking script that iterates over tables and columns with masking rules (email → pseudonym, card → last 4, SSN → redacted). Define rules declaratively (YAML config mapping table.column → masking_function). Run before every dev/test refresh. Key considerations: preserve referential integrity (foreign keys must still match after masking), handle NULLs, and verify the masked DB is functionally equivalent for testing purposes.”}}]}

Data masking, tokenization, and PCI-DSS compliance design is in Stripe system design interview questions.

Data privacy, masking, and compliance system design is covered in Coinbase system design interview preparation.

Data masking and privacy-preserving analytics design is in Databricks system design interview guide.