Low Level Design: Data Anonymization Service

Overview

A data anonymization service transforms raw datasets containing personally identifiable information (PII) into privacy-safe versions that can be used for analytics, testing, ML training, and sharing with third parties. In a low level design interview, structure your answer around the pipeline stages: detect, classify, anonymize, validate, audit.

PII Detection

Regex patterns are the first line of detection. Maintain a library of patterns for common PII types: email addresses, phone numbers (with country-code variants), Social Security Numbers, credit card numbers (Luhn-validated), passport numbers, IP addresses, and dates of birth. Regex is fast and deterministic but cannot detect freeform text like names embedded in sentences or contextual PII (e.g., a combination of ZIP code + age + gender that uniquely identifies someone).

ML Named Entity Recognition (NER) handles what regex cannot. A fine-tuned NER model (spaCy, Flair, or a transformer-based model) tags spans in unstructured text as PERSON, LOCATION, ORG, DATE, etc. Run the model on text fields that cannot be fully covered by regex. The trade-off is latency and compute cost — run NER asynchronously in a background worker pool, not inline on the hot path.

Schema registry field tagging — for structured datasets (database tables, Avro/Protobuf schemas, Parquet files), register each field with a PII classification label in a schema registry. Upstream data producers tag fields at schema definition time: email: pii:email, user_name: pii:name, purchase_amount: non-pii. The anonymization pipeline reads field tags from the registry and applies the appropriate technique without scanning field contents. This is the most reliable approach for structured data and eliminates per-record detection overhead.

Anonymization Techniques

Masking — replace the field value with a fixed character (asterisks, Xs). Irreversible. Use for fields that downstream consumers never need to read (e.g., full credit card number → ****-****-****-1234 showing only the last four digits). Fast: O(1) per field.

Pseudonymization — replace the PII value with a deterministic pseudonym derived from a keyed hash: HMAC-SHA256(secret_salt, email). The same input always produces the same output, so relational joins across pseudonymized datasets still work (user ID in events table matches user ID in profile table). The mapping is reversible only if you know the salt — store the salt in a secrets manager (Vault, AWS Secrets Manager), not in the application config. Rotate the salt to break existing pseudonyms when required by a deletion request.

Tokenization — replace the PII value with a random token and store the mapping in a secure token vault. Unlike pseudonymization, there is no mathematical relationship between the token and the original value. Re-identification requires a round-trip to the vault. Used for payment card data (PCI-DSS scope reduction) where the raw value must be retrievable for authorized operations but invisible to most systems.

Generalization — replace a precise value with a less precise one. Age 34 becomes age range 30-39. ZIP code 10001 becomes the first three digits 100**. City becomes region. This preserves analytical utility while reducing identifiability. The granularity of generalization is a parameter tuned per field based on re-identification risk scoring.

Suppression — remove the field entirely. Used when no downstream consumer needs the field and its presence creates unnecessary risk. Suppressed fields are dropped at pipeline ingestion before any storage occurs.

Pipeline Orchestration for Bulk Anonymization

For bulk dataset anonymization (nightly export, historical backfill), use a distributed processing framework: Spark for large-scale batch, or a worker pool of containerized jobs for medium scale. The pipeline reads source data in partitions, applies the field-level anonymization rules from the schema registry, and writes anonymized output to a separate storage location. Never overwrite source data in place.

Pipeline stages: (1) schema resolution — fetch field tags from registry, (2) PII detection — apply regex and NER on untagged text fields, (3) technique application — apply masking/pseudonymization/tokenization/generalization/suppression per field tag, (4) output validation — sample anonymized records and run PII detection again to verify no PII leaked through, (5) delivery — write to destination and publish a completion event.

Parameterize the pipeline so the same code handles batch (all records) and streaming (Kafka consumer applying anonymization per message in real time) by swapping the source/sink adapters.

Re-Identification Risk: k-Anonymity Scoring

A dataset satisfies k-anonymity if every record is indistinguishable from at least k-1 other records with respect to quasi-identifier fields (fields that are not PII individually but can identify someone in combination: ZIP, age, gender). Compute k for each equivalence class after generalization. If any class has k less than the target threshold (typically 5 or 10), increase generalization on quasi-identifiers — coarsen the age bucket, suppress the ZIP — until the threshold is met.

k-Anonymity alone does not protect against attribute disclosure (all records in a group share the same sensitive value). l-Diversity additionally requires that each equivalence class contains at least l distinct values for sensitive attributes. Implement both checks in the validation stage.

Differential Privacy for Analytics

For aggregate analytics outputs (counts, sums, averages published as dashboards or APIs), apply differential privacy by adding calibrated Laplace or Gaussian noise to query results. The privacy budget epsilon controls the trade-off: small epsilon means more noise and stronger privacy, large epsilon means less noise and weaker privacy. Track epsilon consumption per dataset — once the budget is exhausted, further queries are rejected or require a new data release cycle.

Libraries: Google DP library, OpenDP, Apple DP libraries. These handle sensitivity calculation and noise calibration — do not implement noise addition manually.

Audit Trail and Compliance Reporting

Every anonymization job must produce an immutable audit record: job ID, timestamp, source dataset, anonymization rules applied per field, record count processed, validation results, output location, and operator identity. Store audit records in an append-only log (no UPDATE or DELETE). Use hash-chaining or write to WORM storage to provide tamper evidence.

Compliance reports are generated from audit records on demand: for GDPR Article 30 (records of processing activities), pull all jobs that processed a given data subject ID. For auditor review, export the schema registry field tags and the anonymization technique applied per field in a structured format. The audit trail also supports right-to-erasure workflows: when a user requests deletion, identify all pseudonymized datasets derived from their data and rotate the salt or delete the token vault entries, rendering the pseudonyms unresolvable.

Frequently Asked Questions

What is a data anonymization service and why is it needed?

A data anonymization service transforms personally identifiable information (PII) so that individuals cannot be identified from the resulting dataset, enabling organizations to share, analyze, or retain data in compliance with regulations such as GDPR, CCPA, and HIPAA. It is needed because raw production data often must flow to analytics pipelines, machine-learning training sets, third-party vendors, or lower environments (staging, QA) where unrestricted PII exposure would create legal and reputational risk.

What is the difference between masking, pseudonymization, and tokenization?

Data masking replaces sensitive values with realistic but fictitious data (e.g., replacing a real name with a randomly generated name) and the original value is not recoverable from the masked output. Pseudonymization replaces identifiers with a deterministic pseudonym (e.g., a keyed hash) so records can be re-linked by parties holding the key, satisfying GDPR’s reduced-risk category. Tokenization replaces sensitive values with opaque tokens stored in a secure token vault; the original value can be retrieved by authorized systems by looking up the token, making it suitable for payment card data (PCI-DSS) where the original value must sometimes be retrieved.

How do you measure re-identification risk in an anonymized dataset?

Re-identification risk is assessed by estimating the probability that an attacker with access to auxiliary information can link records back to individuals. Common metrics include k-anonymity (every record is indistinguishable from at least k-1 others on quasi-identifiers), l-diversity (sensitive attributes within each equivalence class have at least l distinct values), and t-closeness (the distribution of sensitive attributes in each class is close to the global distribution). Formal privacy models such as differential privacy provide a mathematically provable bound on information leakage and are increasingly the standard for high-risk datasets.

What is differential privacy and when do you use it?

Differential privacy (DP) is a mathematical framework that guarantees the output of a computation changes by at most a bounded factor (controlled by the privacy parameter ε) whether or not any single individual’s record is included in the input dataset. It is implemented by adding calibrated random noise (Laplace or Gaussian mechanisms) to query results or model gradients. You use DP when you need a provable, auditable privacy guarantee — for example, when publishing aggregate statistics from sensitive census or health data, training ML models on user data (federated learning with DP-SGD), or releasing synthetic datasets. The trade-off is reduced accuracy, so ε must be tuned to balance utility and privacy protection.