Why PII Scrubbing Is Hard
PII (Personally Identifiable Information) appears in many forms: structured fields like email addresses and SSNs, and unstructured free text like customer support messages containing names, addresses, and dates of birth. A scrubber must catch both with high recall, avoid false positives that corrupt legitimate data, and process high-volume log streams with low latency.
PII Types to Detect
- Structured: email address, phone number, SSN, credit card number, IP address, date of birth (in standard formats)
- Unstructured: person names, physical addresses, organization names in free text
Detection Approach: Regex + NER
Two complementary detection mechanisms are combined:
- Regex: High precision for structured PII. Patterns are deterministic and fast. Email, SSN, credit card, and US phone numbers are well-suited to regex.
- NER (Named Entity Recognition): Machine-learned model that classifies tokens in free text as PERSON, LOCATION, ORGANIZATION, DATE. Required for names and addresses that have no fixed format.
The Microsoft Presidio library combines both: regex analyzers for structured PII, spaCy NER model for unstructured, with a unified detection interface and configurable confidence threshold.
Regex Patterns
Email: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}
SSN: bd{3}-d{2}-d{4}b
Credit card: b(?:d[ -]?){13,16}b (followed by Luhn check)
US phone: b(+1[-.s]?)?(?d{3})?[-.s]?d{3}[-.s]?d{4}b
IP address: b(?:d{1,3}.){3}d{1,3}b
Credit card detection uses the Luhn algorithm checksum after regex match to reduce false positives — random digit sequences that match the pattern but fail the checksum are discarded.
Redaction Strategies
Different use cases call for different redaction approaches:
- Type-label replacement: Replace detected PII with its type:
user@example.com→[EMAIL]. Simple, loses all information. Suitable for logs that only need compliance. - Pseudonymization: Deterministic hash maps PII to a consistent fake value. The same email always becomes the same pseudonym. Preserves referential integrity for analytics.
HMAC-SHA256(secret_key, email)→ stable pseudonym. - Partial masking: Keep first and last character, mask the middle:
j***e@e***.com. Human-recognizable for support workflows while hiding the full value. - Format-preserving replacement: Replace with a fake value in the same format. Replace a credit card with a Luhn-valid fake card number. Replace an email with a valid-format fake email at the same domain structure.
Redaction strategy is configured per PII type, not globally.
Streaming Log Scrubbing Pipeline
For high-volume application logs:
- Application writes logs to stdout / local file
- Log agent (Fluentd, Vector) tails log file and publishes to Kafka topic
raw-logs - Scrubber service consumes from
raw-logsin parallel consumer group - Each message is run through the detection + redaction pipeline
- Scrubbed message published to
clean-logsKafka topic - Downstream consumers (Elasticsearch, S3) read from
clean-logsonly
Latency target: under 100ms per message at p99. NER model inference is the bottleneck; batching messages through the model (32 messages per batch) improves throughput significantly.
Precision vs Recall Trade-Off
Tuning the NER confidence threshold controls the precision-recall trade-off:
- False positive (high precision, low recall): Non-PII text is redacted. Corrupts data, frustrates users who can't read logs.
- False negative (high recall, low precision): Real PII passes through. Compliance violation.
For compliance use cases, tune for high recall (lower threshold). For analytics use cases where data quality matters, tune for high precision. A/B test threshold changes against a labeled evaluation set before deploying.
Allowlist
Some strings match PII patterns but are not PII. A configurable allowlist suppresses false positives:
support@company.com— company email, not a user's PII192.168.1.1— private IP, not a user's public IP000-00-0000— SSN pattern used as a placeholder in test data
Allowlist entries are checked before regex results are passed to the redactor.
Scrubbing Pipeline Architecture
input text
→ tokenize
→ regex scan (structured PII)
→ NER model (unstructured PII)
→ merge detections, deduplicate overlapping spans
→ allowlist filter
→ apply redaction strategy per span type
→ output redacted text
Audit Trail
The scrubber logs what was redacted (type and position in the text) but not the original value. Audit records: {timestamp, message_id, detections: [{type: EMAIL, start: 12, end: 29, strategy: PSEUDONYMIZED}]}. This allows compliance reporting without re-exposing PII in the audit log itself.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering