How does NER-based PII detection differ from regex-based detection?

Named Entity Recognition models understand context, so they can identify a person's name like 'James' in a sentence without a fixed pattern, whereas regex relies on structural signatures (e.g., SSN format d{3}-d{2}-d{4}) and misses paraphrased or free-form PII. NER has higher recall for novel entity forms but requires GPU inference and produces false positives that regex-based systems avoid.

What redaction strategies preserve data utility while removing PII?

Format-preserving pseudonymization replaces a PII value with a deterministic fake of the same type (e.g., a real-looking but fictional email address) using keyed encryption, so referential integrity across records is maintained for analytics. Generalization replaces exact values with ranges (e.g., age 34 becomes '30-39') and is used when statistical distribution must be preserved for ML training.

How is a streaming PII scrubber implemented on a log pipeline?

A Kafka Streams or Flink job consumes raw log events, passes each record through a scrubbing function that applies regex and NER detection, replaces matches in-place, and publishes the sanitized event to a separate clean topic with the same partition key to preserve ordering. The scrubber runs as a stateless operator so it scales horizontally by adding consumer group members.

How is the precision-recall tradeoff tuned for a PII scrubber?

The detection confidence threshold is adjusted on a labeled evaluation set: lowering the threshold increases recall (fewer missed PII instances) at the cost of more false positives that redact legitimate data, while raising it does the opposite. In practice, separate thresholds are set per PII category based on regulatory risk—SSNs and credit card numbers use aggressive low thresholds, while generic names use higher thresholds to reduce noise.

PII Scrubber Service Low-Level Design: Detection Pipeline, Redaction Strategies, and Streaming Processing

⏱ 5 min read

Why PII Scrubbing Is Hard

PII (Personally Identifiable Information) appears in many forms: structured fields like email addresses and SSNs, and unstructured free text like customer support messages containing names, addresses, and dates of birth. A scrubber must catch both with high recall, avoid false positives that corrupt legitimate data, and process high-volume log streams with low latency.

PII Types to Detect

Structured: email address, phone number, SSN, credit card number, IP address, date of birth (in standard formats)
Unstructured: person names, physical addresses, organization names in free text

Detection Approach: Regex + NER

Two complementary detection mechanisms are combined:

Regex: High precision for structured PII. Patterns are deterministic and fast. Email, SSN, credit card, and US phone numbers are well-suited to regex.
NER (Named Entity Recognition): Machine-learned model that classifies tokens in free text as PERSON, LOCATION, ORGANIZATION, DATE. Required for names and addresses that have no fixed format.

The Microsoft Presidio library combines both: regex analyzers for structured PII, spaCy NER model for unstructured, with a unified detection interface and configurable confidence threshold.

Regex Patterns

Email:       [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}
SSN:         bd{3}-d{2}-d{4}b
Credit card: b(?:d[ -]?){13,16}b  (followed by Luhn check)
US phone:    b(+1[-.s]?)?(?d{3})?[-.s]?d{3}[-.s]?d{4}b
IP address:  b(?:d{1,3}.){3}d{1,3}b

Credit card detection uses the Luhn algorithm checksum after regex match to reduce false positives — random digit sequences that match the pattern but fail the checksum are discarded.

Redaction Strategies

Different use cases call for different redaction approaches:

Type-label replacement: Replace detected PII with its type: user@example.com → [EMAIL]. Simple, loses all information. Suitable for logs that only need compliance.
Pseudonymization: Deterministic hash maps PII to a consistent fake value. The same email always becomes the same pseudonym. Preserves referential integrity for analytics. HMAC-SHA256(secret_key, email) → stable pseudonym.
Partial masking: Keep first and last character, mask the middle: j***e@e***.com. Human-recognizable for support workflows while hiding the full value.
Format-preserving replacement: Replace with a fake value in the same format. Replace a credit card with a Luhn-valid fake card number. Replace an email with a valid-format fake email at the same domain structure.

Redaction strategy is configured per PII type, not globally.

Streaming Log Scrubbing Pipeline

For high-volume application logs:

Application writes logs to stdout / local file
Log agent (Fluentd, Vector) tails log file and publishes to Kafka topic raw-logs
Scrubber service consumes from raw-logs in parallel consumer group
Each message is run through the detection + redaction pipeline
Scrubbed message published to clean-logs Kafka topic
Downstream consumers (Elasticsearch, S3) read from clean-logs only

Latency target: under 100ms per message at p99. NER model inference is the bottleneck; batching messages through the model (32 messages per batch) improves throughput significantly.

Precision vs Recall Trade-Off

Tuning the NER confidence threshold controls the precision-recall trade-off:

False positive (high precision, low recall): Non-PII text is redacted. Corrupts data, frustrates users who can't read logs.
False negative (high recall, low precision): Real PII passes through. Compliance violation.

For compliance use cases, tune for high recall (lower threshold). For analytics use cases where data quality matters, tune for high precision. A/B test threshold changes against a labeled evaluation set before deploying.

Allowlist

Some strings match PII patterns but are not PII. A configurable allowlist suppresses false positives:

support@company.com — company email, not a user's PII
192.168.1.1 — private IP, not a user's public IP
000-00-0000 — SSN pattern used as a placeholder in test data

Allowlist entries are checked before regex results are passed to the redactor.

Scrubbing Pipeline Architecture

input text
  → tokenize
  → regex scan (structured PII)
  → NER model (unstructured PII)
  → merge detections, deduplicate overlapping spans
  → allowlist filter
  → apply redaction strategy per span type
  → output redacted text

Audit Trail

The scrubber logs what was redacted (type and position in the text) but not the original value. Audit records: {timestamp, message_id, detections: [{type: EMAIL, start: 12, end: 29, strategy: PSEUDONYMIZED}]}. This allows compliance reporting without re-exposing PII in the audit log itself.