What is Data Masking?
Data masking replaces sensitive data (PII, payment cards, SSNs) with realistic-but-fake values for use in non-production environments, analytics, or when displaying to users with limited access. Use cases: (1) Developers need production-like data to test against; (2) Customer support agents should see “****4567” not the full card number; (3) Analytics exports should not contain real email addresses. Without masking, sensitive data leaks into systems that don’t need it — a major compliance and security risk.
Types of Data Masking
- Static masking: create a masked copy of the database for non-production use. Run once when provisioning dev/test environments. The masked copy is persistent.
- Dynamic masking: mask data at query time based on the requesting user’s role. The underlying data is unchanged; the masked view is presented to low-privilege users.
- Tokenization: replace sensitive value with a random token. The mapping is stored in a secure vault. Can de-tokenize with the right credentials. Used for payment cards (PCI-DSS compliance).
- Pseudonymization: deterministic substitution — same input always produces same output (HMAC-based). Allows correlation across records without revealing the original value.
Masking Techniques
import hashlib, secrets
# 1. Redaction: replace entirely
mask_email('user@example.com') → '***@***.***'
# 2. Partial masking: show last N chars
mask_card('4111111111111234') → '****-****-****-1234'
mask_phone('+14155551234') → '+1415***1234'
# 3. Pseudonymization (deterministic, HMAC-based)
def pseudo_email(email, secret_key):
h = hmac.new(secret_key, email.encode(), hashlib.sha256).hexdigest()[:8]
domain = email.split('@')[1]
return f'user_{h}@{domain}'
# Same email always maps to same pseudonym; different secret = different mapping
# 4. Tokenization
def tokenize(value, vault):
token = vault.get_token(value) # check existing mapping
if not token:
token = secrets.token_urlsafe(16)
vault.store(token, value) # store token → plaintext
return token
def detokenize(token, vault):
return vault.get(token)
Dynamic Masking at the DB Layer
-- PostgreSQL: masking view based on current user's role
CREATE VIEW users_masked AS
SELECT
user_id,
CASE WHEN current_setting('app.user_role') IN ('admin', 'finance')
THEN email
ELSE regexp_replace(email, '(.{2}).+(@.+)', '1***2')
END AS email,
name,
CASE WHEN current_setting('app.user_role') = 'admin'
THEN phone
ELSE regexp_replace(phone, '(+d{4})d+(d{4})', '1***2')
END AS phone
FROM users;
-- Application sets role before query
SET LOCAL app.user_role = 'support';
SELECT * FROM users_masked WHERE user_id = :id;
-- support agent sees: user@*** (masked email)
Static Masking Pipeline for Dev/Test
# Run after cloning production DB to dev environment
def mask_database(db_conn):
# Define masking rules per table/column
rules = {
'users': {
'email': lambda v: pseudo_email(v, SECRET),
'phone': lambda v: mask_phone(v),
'ssn': lambda v: '***-**-' + v[-4:],
'dob': lambda v: v.replace(day=1), # keep month/year, obscure day
},
'payment_methods': {
'card_number': lambda v: '****-****-****-' + v[-4:],
'cvv': lambda v: '***',
}
}
for table, columns in rules.items():
rows = db_conn.query(f'SELECT * FROM {table}')
for row in rows:
updates = {col: fn(row[col]) for col, fn in columns.items()
if row[col] is not None}
db_conn.update(table, row['id'], updates)
Key Design Decisions
- Pseudonymization over random masking for analytics — deterministic mapping preserves analytical relationships (all rows with same email still correlate)
- Dynamic masking via DB view — single source of truth; masking logic centralized, not scattered across application code
- Tokenization for payment cards (PCI-DSS) — tokens can be de-tokenized for authorized operations; raw card numbers never stored in application DB
- Column-level masking rules — mask only what’s necessary; over-masking breaks dev/test workflows
- Secret key rotation — re-running pseudonymization with a new key produces different pseudonyms; rotate annually
Data masking, tokenization, and PCI-DSS compliance design is in Stripe system design interview questions.
Data privacy, masking, and compliance system design is covered in Coinbase system design interview preparation.
Data masking and privacy-preserving analytics design is in Databricks system design interview guide.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering