Low Level Design: Document OCR Service

An OCR service must handle raw document images, extract structured text, assess confidence, and process jobs asynchronously at scale. Here is the complete low level design.

OCR Pipeline Overview

The pipeline runs sequentially per page:

Image upload and validation
Preprocessing
Text region detection
Character recognition
Layout analysis
Structured data extraction
Confidence scoring
Output assembly

PDF inputs are rendered to per-page images first (using PDFium or Ghostscript at 300 DPI), then each page runs through the full pipeline independently.

Image Preprocessing

Raw document scans often have skew, noise, and poor contrast. Preprocessing fixes these before recognition:

Deskew — detect the dominant text line angle using a Hough transform on binarized edges, then rotate the image to correct skew. Even 2-3 degrees of skew degrades recognition accuracy significantly.
Denoise — apply a non-local means filter or median filter to remove scan noise and speckle. Preserve edge sharpness (do not over-blur).
Binarize — convert to black-and-white using Otsu thresholding or adaptive thresholding. Required for traditional OCR engines; optional for deep learning models that work on grayscale.
Contrast enhancement — apply CLAHE to recover text in low-contrast regions, e.g., faded ink or shadow from document curl.
Border removal — detect and crop scanner borders, black frames, and punch-hole artifacts that confuse region detectors.

Text Region Detection

Detect bounding boxes around text regions before running character recognition:

CRAFT (Character Region Awareness for Text Detection) — produces character-level and word-level affinity heatmaps. Good for curved or irregular text. Works well on forms and mixed-layout documents.
EAST (Efficient and Accurate Scene Text Detector) — faster, outputs rotated bounding boxes. Better for documents with uniform text orientation.
Connected component analysis — for binarized images, group connected dark pixels into candidate text blobs, then filter by size and aspect ratio. Faster than deep learning detectors but less robust on complex layouts.

Output of this stage: a list of bounding boxes (x, y, width, height, angle) with a detection confidence score per region.

Character Recognition

Two approaches, selectable per deployment:

Tesseract — open source, battle-tested, supports 100+ languages. Uses LSTM-based recognition in v4+. Fast and low memory. Best for clean, printed documents.
CRNN with CTC loss — Convolutional Recurrent Neural Network with Connectionist Temporal Classification loss. The CNN extracts visual features per column of the text region; the RNN (LSTM or GRU) reads the feature sequence left-to-right; CTC decodes the output sequence without requiring character-level segmentation. Handles irregular spacing and handwritten text better than Tesseract.

For each detected region, crop the image patch, pass it to the recognition model, and receive a character sequence with per-character confidence scores.

Layout Analysis

Raw region detections do not encode document structure. Layout analysis recovers it:

Column detection — cluster text regions by x-coordinate gaps wider than a threshold. A two-column document has two distinct x-range clusters.
Reading order — sort regions: primary sort by column assignment, secondary sort by y-coordinate (top to bottom). For multi-column documents, left column regions precede right column regions regardless of y.
Table detection — detect grid lines (horizontal and vertical rules) using morphological operations on the binarized image. Intersect lines to find cell boundaries. Assign text regions to cells by containment.
Heading and paragraph classification — classify regions by font size (estimated from bounding box height relative to median line height) and spatial isolation as headings or body text.

Structured Data Extraction

For forms and invoices, extract key-value pairs. Two approaches:

Rule-based — define field templates (label text + spatial offset to value region). E.g., find the region containing the text Invoice Number and read the text region immediately to its right.
ML-based (LayoutLM or Donut) — transformer models that jointly encode text content and 2D position. Fine-tune on labeled form examples. Handles variable layouts without hand-crafted rules.

Table extraction: parse the detected grid cells row by row, output as a 2D array with header row detection.

Confidence Scoring

Assign confidence at multiple levels: per character (from model softmax), per word (min character confidence in the word), per field (mean word confidence for the field value). Fields below a configurable threshold (e.g., 0.80) are flagged for human review. Store both the extracted value and the confidence in the output schema.

Async Job Processing

OCR is compute-heavy. Do not process synchronously in the request path:

Client uploads document and receives a job ID immediately.
Job is published to a Kafka topic. Workers consume and process pages in parallel.
Results are written to object storage (S3) as a structured JSON output.
On completion, the service calls the client-provided webhook URL with the job ID and result URL.
Client can also poll a job status endpoint: pending → processing → done/failed.

Output Schema

Each output document contains: raw full text (reading-order concatenation), an array of pages, each with an array of regions (bounding box, text, confidence), extracted fields (key, value, confidence, bounding box), and extracted tables (array of rows, each an array of cell objects with text and bounding box). Low-confidence fields are also included in a separate human review queue record so they can be surfaced to a reviewer without the client needing to implement that logic.

Frequently Asked Questions

What is a document OCR service in system design?

A document OCR (optical character recognition) service converts images of text-bearing documents into machine-readable structured data. It accepts scanned images, photos, or PDFs, runs them through an image preprocessing pipeline, applies a recognition model to extract text, and returns the result as plain text, JSON with bounding boxes, or structured fields (e.g., invoice line items). Enterprise OCR services also perform layout analysis (detecting tables, columns, headers) and integrate with downstream data pipelines for search indexing or data entry automation.

What is the OCR processing pipeline from image to structured data?

The pipeline has five stages: (1) Preprocessing — deskew, denoise, binarize, and normalize image resolution. (2) Layout analysis — detect and classify page regions (text blocks, tables, figures, headers) using a document layout model. (3) Text line segmentation — within each text region, identify individual text lines and their reading order. (4) Character recognition — pass each text line through a CNN-LSTM or Transformer-based OCR model to produce a character sequence with per-character confidence scores. (5) Post-processing — apply dictionary correction, entity extraction, and schema mapping to produce structured output. Results are assembled with bounding-box metadata for downstream verification and highlighting.

How does confidence scoring work in an OCR service?

Confidence scores are derived from the recognition model’s output probability distribution. For each character position the model emits a softmax probability over the character vocabulary; the top-1 probability serves as the character-level confidence. Word-level confidence is typically the product or minimum of its character confidences. Document-level confidence aggregates word scores, weighted by token frequency or importance. Low-confidence regions are flagged for human review — a threshold (e.g., 0.85) triggers routing to a review queue where an operator corrects the OCR output, and the corrected pair can be used to fine-tune the model. Confidence metadata is returned alongside the text so downstream applications can apply their own thresholds.

How do you handle PDF documents in an OCR pipeline?

PDFs require a pre-processing fork: if the PDF contains embedded text (a “digital” PDF), text can be extracted directly using a PDF parsing library without running OCR, which is faster and more accurate. If the PDF is image-only (a scanned PDF), each page is rasterized to a high-resolution image (300 DPI minimum) and passed to the OCR pipeline. Mixed PDFs require per-page detection to select the right path. Multi-page PDFs are split into pages and processed in parallel across workers, with results merged and re-paginated. Large documents are handled via a job queue with progress tracking so callers can poll for completion rather than blocking on a synchronous response.