PII Scrubber Service Low-Level Design: Detection Pipeline, Redaction Strategies, and Streaming Processing

Why PII Scrubbing Is Hard

PII (Personally Identifiable Information) appears in many forms: structured fields like email addresses and SSNs, and unstructured free text like customer support messages containing names, addresses, and dates of birth. A scrubber must catch both with high recall, avoid false positives that corrupt legitimate data, and process high-volume log streams with low latency.

PII Types to Detect

  • Structured: email address, phone number, SSN, credit card number, IP address, date of birth (in standard formats)
  • Unstructured: person names, physical addresses, organization names in free text

Detection Approach: Regex + NER

Two complementary detection mechanisms are combined:

  • Regex: High precision for structured PII. Patterns are deterministic and fast. Email, SSN, credit card, and US phone numbers are well-suited to regex.
  • NER (Named Entity Recognition): Machine-learned model that classifies tokens in free text as PERSON, LOCATION, ORGANIZATION, DATE. Required for names and addresses that have no fixed format.

The Microsoft Presidio library combines both: regex analyzers for structured PII, spaCy NER model for unstructured, with a unified detection interface and configurable confidence threshold.

Regex Patterns

Email:       [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}
SSN:         bd{3}-d{2}-d{4}b
Credit card: b(?:d[ -]?){13,16}b  (followed by Luhn check)
US phone:    b(+1[-.s]?)?(?d{3})?[-.s]?d{3}[-.s]?d{4}b
IP address:  b(?:d{1,3}.){3}d{1,3}b

Credit card detection uses the Luhn algorithm checksum after regex match to reduce false positives — random digit sequences that match the pattern but fail the checksum are discarded.

Redaction Strategies

Different use cases call for different redaction approaches:

  • Type-label replacement: Replace detected PII with its type: user@example.com[EMAIL]. Simple, loses all information. Suitable for logs that only need compliance.
  • Pseudonymization: Deterministic hash maps PII to a consistent fake value. The same email always becomes the same pseudonym. Preserves referential integrity for analytics. HMAC-SHA256(secret_key, email) → stable pseudonym.
  • Partial masking: Keep first and last character, mask the middle: j***e@e***.com. Human-recognizable for support workflows while hiding the full value.
  • Format-preserving replacement: Replace with a fake value in the same format. Replace a credit card with a Luhn-valid fake card number. Replace an email with a valid-format fake email at the same domain structure.

Redaction strategy is configured per PII type, not globally.

Streaming Log Scrubbing Pipeline

For high-volume application logs:

  1. Application writes logs to stdout / local file
  2. Log agent (Fluentd, Vector) tails log file and publishes to Kafka topic raw-logs
  3. Scrubber service consumes from raw-logs in parallel consumer group
  4. Each message is run through the detection + redaction pipeline
  5. Scrubbed message published to clean-logs Kafka topic
  6. Downstream consumers (Elasticsearch, S3) read from clean-logs only

Latency target: under 100ms per message at p99. NER model inference is the bottleneck; batching messages through the model (32 messages per batch) improves throughput significantly.

Precision vs Recall Trade-Off

Tuning the NER confidence threshold controls the precision-recall trade-off:

  • False positive (high precision, low recall): Non-PII text is redacted. Corrupts data, frustrates users who can't read logs.
  • False negative (high recall, low precision): Real PII passes through. Compliance violation.

For compliance use cases, tune for high recall (lower threshold). For analytics use cases where data quality matters, tune for high precision. A/B test threshold changes against a labeled evaluation set before deploying.

Allowlist

Some strings match PII patterns but are not PII. A configurable allowlist suppresses false positives:

  • support@company.com — company email, not a user's PII
  • 192.168.1.1 — private IP, not a user's public IP
  • 000-00-0000 — SSN pattern used as a placeholder in test data

Allowlist entries are checked before regex results are passed to the redactor.

Scrubbing Pipeline Architecture

input text
  → tokenize
  → regex scan (structured PII)
  → NER model (unstructured PII)
  → merge detections, deduplicate overlapping spans
  → allowlist filter
  → apply redaction strategy per span type
  → output redacted text

Audit Trail

The scrubber logs what was redacted (type and position in the text) but not the original value. Audit records: {timestamp, message_id, detections: [{type: EMAIL, start: 12, end: 29, strategy: PSEUDONYMIZED}]}. This allows compliance reporting without re-exposing PII in the audit log itself.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top