How does format detection via magic bytes work in a document parser?

Magic bytes are fixed byte sequences at known offsets in a file that identify its format regardless of file extension. For example, PDF files start with %PDF, ZIP-based formats (DOCX, XLSX) start with PKx03x04, and PNG files start with x89PNG. The parser reads the first 16-512 bytes and matches against a lookup table of signatures to determine the correct extraction strategy before any content processing begins.

How is a multi-format extraction pipeline structured in a document parser?

After format detection, the pipeline routes the document to a format-specific extractor (PDF, DOCX, HTML, PPTX, etc.). Each extractor implements a common interface returning a normalized intermediate representation: a list of content blocks with type (paragraph, heading, table, image), text content, and positional metadata. Downstream stages (NLP, indexing, export) consume this normalized form without knowing the source format.

When and how is OCR used as a fallback in a document parser?

OCR is triggered when the primary extractor returns no text (scanned PDFs, image-only files) or when text density falls below a threshold (e.g., fewer than 10 characters per page). The document is rendered to images at high DPI and passed to an OCR engine (Tesseract, Google Document AI, AWS Textract). OCR output is merged back into the normalized block structure, with a confidence score attached so downstream consumers can filter low-confidence text.

What does the structured JSON output of a document parser look like?

The output JSON contains document-level metadata (title, author, page count, detected language, source format) and a blocks array. Each block has: type (paragraph, heading, table, image, list), text or cell content, page number, bounding box coordinates, and an optional confidence score for OCR-derived blocks. Tables are represented as a 2D array of cell objects. This schema is versioned so consumers can adapt to schema evolution.

Document Parser Service Low-Level Design: Format Detection, Extraction Pipeline, and Structured Output

⏱ 6 min read

Document Parser Service: Low-Level Design

A document parser service accepts uploaded files in multiple formats (PDF, DOCX, XLSX, HTML, plain text, images with text), detects the format, runs a format-specific extraction pipeline, normalizes extracted fields into a canonical schema, and returns structured JSON output. It is used in document intelligence platforms, data ingestion pipelines, and automated compliance workflows.

Requirements

Functional

Accept file uploads up to 100 MB via API or async job submission
Detect file format automatically using magic bytes and MIME type, falling back to file extension
Extract text content, tables, key-value pairs, and metadata (author, created_at, page count)
Apply field normalization: date parsing, currency parsing, entity recognition (names, addresses, amounts)
Return structured JSON with extracted fields, confidence scores, and bounding box coordinates where applicable
Support pluggable post-processing steps: redaction, classification, schema validation

Non-Functional

Synchronous mode: return results within 5 seconds for documents under 10 pages
Async mode: complete within 60 seconds for documents up to 100 pages
Throughput: 500 concurrent parse jobs

Data Model

parse_jobs: job_id (UUID), file_key (S3 key), file_name (TEXT), mime_type (TEXT), file_size_bytes (INT), status (ENUM: queued, processing, completed, failed), submitted_at, completed_at, error_message (TEXT), owner_id
parse_results: result_id (UUID), job_id, schema_version (TEXT), raw_text (TEXT), structured_output (JSONB), confidence_score (FLOAT), page_count (INT), created_at
extraction_schemas: schema_id (UUID), name, field_definitions (JSONB — list of {field_name, data_type, required, pattern}), created_at
redaction_rules: rule_id (UUID), owner_id, pattern_type (ENUM: regex, entity_type), pattern (TEXT), replacement (TEXT), active (BOOL)

Core Algorithms

Format Detection

Magic byte inspection reads the first 512 bytes of the file. A lookup table maps byte signatures to MIME types: PDF files start with %PDF, DOCX and XLSX are ZIP archives containing specific XML entries ([Content_Types].xml), PNG starts with the 8-byte PNG signature. If magic bytes are ambiguous, the service checks the MIME type declared in the upload Content-Type header, then falls back to the file extension. Unknown formats are rejected with a 415 error.

Multi-Format Extraction Pipeline

Each format has a dedicated extractor class registered in a format registry. The PDF extractor uses pdfminer for text-layer extraction with bounding box coordinates and falls back to Tesseract OCR for scanned pages (detected by low character density per page area). The DOCX extractor uses python-docx to traverse the XML DOM, preserving table structure as a 2D array. The XLSX extractor uses openpyxl with a sheet iterator that limits memory to one row batch at a time. HTML extraction uses BeautifulSoup with boilerplate removal heuristics (remove nav, footer, script, style elements).

Field Normalization

After raw extraction, the normalization pipeline applies a sequence of NLP steps: date expressions are parsed using dateparser (handles 40+ locale formats); currency amounts are extracted using a regex with ISO 4217 currency code lookup; named entity recognition (using a fine-tuned spaCy model) identifies person names, organizations, and addresses. Each extracted field is scored with a confidence value derived from the extractor signal strength and entity model probability. Fields below a confidence threshold (0.5 by default) are flagged for human review.

Scalability and Architecture

File uploads land in S3 via a pre-signed upload URL. The upload completion event (S3 event notification) triggers a message to an SQS queue. A fleet of parser workers (EC2 Auto Scaling Group or Kubernetes Deployment) polls the queue, downloads the file to local ephemeral storage (never to EBS to avoid IOPS contention), runs the extraction pipeline, writes results to Postgres, and deletes the queue message. Large files (over 10 MB) are processed on spot instances to manage cost.

Worker concurrency is controlled per instance: 4 parallel jobs for CPU-bound PDF OCR, 16 for text-only formats
GPU-accelerated OCR workers (using PaddleOCR or Tesseract with CUDA) are available as an optional tier for high-volume image-heavy documents
Parse results JSONB column is indexed with GIN for field-level queries (e.g., find all jobs where structured_output contains invoice_number)
Redaction is applied as a post-processing step before writing parse_results; original text is never persisted if redaction rules are active for the owner

API Design

Synchronous Parse

POST /v1/parse — multipart file upload, returns structured JSON directly for documents processed within the timeout. Response includes job_id, mime_type, page_count, confidence_score, and the structured_output object.

Async Job Submission

POST /v1/parse/async — accepts a file_key (pre-uploaded to S3) or multipart upload, returns job_id immediately
GET /v1/jobs/{job_id} — poll job status; on completion returns a result_url (pre-signed S3 link to full result JSON)
POST /v1/jobs/{job_id}/webhook — register a callback URL to receive the completed result via HTTP POST

Schema and Redaction Management

POST /v1/schemas — define a custom extraction schema; returned schema_id can be passed to parse requests
POST /v1/redaction-rules — create a redaction rule (regex or entity type) applied to all parse jobs for the owner

Interview Tips

Interviewers probe failure modes: what happens when OCR produces garbled output (implement a character confidence threshold and flag low-confidence pages rather than silently returning garbage); how to handle password-protected PDFs (return a specific error code, not a parse failure); and memory management for large XLSX files (use openpyxl read-only mode with row streaming to avoid loading the entire workbook into memory). Also discuss the tradeoff between synchronous and async modes: forcing all requests to async simplifies the service but adds latency for small documents, so a hybrid approach with a 10-second synchronous timeout before automatically falling back to async is a reasonable design choice.