Data Masking System Low-Level Design

What is Data Masking?

Data masking replaces sensitive data (PII, payment cards, SSNs) with realistic-but-fake values for use in non-production environments, analytics, or when displaying to users with limited access. Use cases: (1) Developers need production-like data to test against; (2) Customer support agents should see “****4567” not the full card number; (3) Analytics exports should not contain real email addresses. Without masking, sensitive data leaks into systems that don’t need it — a major compliance and security risk.

Types of Data Masking

  • Static masking: create a masked copy of the database for non-production use. Run once when provisioning dev/test environments. The masked copy is persistent.
  • Dynamic masking: mask data at query time based on the requesting user’s role. The underlying data is unchanged; the masked view is presented to low-privilege users.
  • Tokenization: replace sensitive value with a random token. The mapping is stored in a secure vault. Can de-tokenize with the right credentials. Used for payment cards (PCI-DSS compliance).
  • Pseudonymization: deterministic substitution — same input always produces same output (HMAC-based). Allows correlation across records without revealing the original value.

Masking Techniques

import hashlib, secrets

# 1. Redaction: replace entirely
mask_email('user@example.com') → '***@***.***'

# 2. Partial masking: show last N chars
mask_card('4111111111111234') → '****-****-****-1234'
mask_phone('+14155551234')    → '+1415***1234'

# 3. Pseudonymization (deterministic, HMAC-based)
def pseudo_email(email, secret_key):
    h = hmac.new(secret_key, email.encode(), hashlib.sha256).hexdigest()[:8]
    domain = email.split('@')[1]
    return f'user_{h}@{domain}'
# Same email always maps to same pseudonym; different secret = different mapping

# 4. Tokenization
def tokenize(value, vault):
    token = vault.get_token(value)      # check existing mapping
    if not token:
        token = secrets.token_urlsafe(16)
        vault.store(token, value)        # store token → plaintext
    return token

def detokenize(token, vault):
    return vault.get(token)

Dynamic Masking at the DB Layer

-- PostgreSQL: masking view based on current user's role
CREATE VIEW users_masked AS
SELECT
    user_id,
    CASE WHEN current_setting('app.user_role') IN ('admin', 'finance')
         THEN email
         ELSE regexp_replace(email, '(.{2}).+(@.+)', '1***2')
    END AS email,
    name,
    CASE WHEN current_setting('app.user_role') = 'admin'
         THEN phone
         ELSE regexp_replace(phone, '(+d{4})d+(d{4})', '1***2')
    END AS phone
FROM users;

-- Application sets role before query
SET LOCAL app.user_role = 'support';
SELECT * FROM users_masked WHERE user_id = :id;
-- support agent sees: user@*** (masked email)

Static Masking Pipeline for Dev/Test

# Run after cloning production DB to dev environment
def mask_database(db_conn):
    # Define masking rules per table/column
    rules = {
        'users': {
            'email':    lambda v: pseudo_email(v, SECRET),
            'phone':    lambda v: mask_phone(v),
            'ssn':      lambda v: '***-**-' + v[-4:],
            'dob':      lambda v: v.replace(day=1),  # keep month/year, obscure day
        },
        'payment_methods': {
            'card_number': lambda v: '****-****-****-' + v[-4:],
            'cvv':         lambda v: '***',
        }
    }

    for table, columns in rules.items():
        rows = db_conn.query(f'SELECT * FROM {table}')
        for row in rows:
            updates = {col: fn(row[col]) for col, fn in columns.items()
                       if row[col] is not None}
            db_conn.update(table, row['id'], updates)

Key Design Decisions

  • Pseudonymization over random masking for analytics — deterministic mapping preserves analytical relationships (all rows with same email still correlate)
  • Dynamic masking via DB view — single source of truth; masking logic centralized, not scattered across application code
  • Tokenization for payment cards (PCI-DSS) — tokens can be de-tokenized for authorized operations; raw card numbers never stored in application DB
  • Column-level masking rules — mask only what’s necessary; over-masking breaks dev/test workflows
  • Secret key rotation — re-running pseudonymization with a new key produces different pseudonyms; rotate annually

Data masking, tokenization, and PCI-DSS compliance design is in Stripe system design interview questions.

Data privacy, masking, and compliance system design is covered in Coinbase system design interview preparation.

Data masking and privacy-preserving analytics design is in Databricks system design interview guide.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top