OCR Service Low-Level Design: Document Preprocessing, Text Extraction, and Confidence Scoring

OCR Service System Design Overview

An OCR (optical character recognition) service converts images of documents, receipts, screenshots, and scanned pages into structured text. Beyond raw character extraction, a production OCR service must handle image quality variation through preprocessing, analyze document layout to preserve reading order, assign confidence scores to extracted regions, and deliver structured output that downstream consumers can parse reliably.

Requirements

Functional Requirements

Extract text from JPEG, PNG, TIFF, and PDF inputs up to 50MB.
Preprocess images: deskew, denoise, binarize, and normalize contrast before OCR.
Analyze document layout: identify text blocks, tables, figures, and reading order.
Assign per-word and per-block confidence scores (0.0 to 1.0).
Return structured output: plain text, word-level bounding boxes, and optionally a reconstructed HTML or Markdown representation preserving layout.

Non-Functional Requirements

Single-page OCR latency under 2 seconds at P99 for standard A4 documents.
Batch processing throughput of 1,000 pages per minute per worker cluster.
Support for 50 languages with per-language model selection.
Idempotent processing: resubmitting the same document returns cached results.

Data Model

ocr_jobs: job_id, document_hash, input_s3_path, language, output_format, status, submitted_at, completed_at, page_count
ocr_pages: job_id, page_number, raw_text, structured_output (JSON), overall_confidence, processing_ms
ocr_words: job_id, page_number, word_id, text, confidence, bounding_box (x, y, width, height), block_id, line_id
result_cache: Redis hash keyed by (document_hash, language, output_format), TTL 30 days

Structured output JSON follows a hierarchical schema: document contains pages, each page contains blocks, each block contains lines, each line contains words, each word contains characters with individual confidence values for high-fidelity use cases.

Image Preprocessing Pipeline

Preprocessing runs as a sequential CPU pipeline before the OCR model receives the image:

Format normalization: convert all inputs to 8-bit grayscale PNG at 300 DPI. PDFs are rasterized page by page using libpoppler.
Deskew: detect document rotation angle using Hough transform on detected line segments; rotate to correct skew up to 45 degrees.
Binarization: apply Sauvola adaptive thresholding (window size 25px, k=0.2) to convert grayscale to binary, handling uneven illumination from photographs of physical documents.
Denoising: apply a 3×3 median filter to remove salt-and-pepper noise common in scanned documents.
Border removal: detect and crop black borders introduced by scanner lids.

Preprocessing is implemented using OpenCV operations. A quality score is computed after preprocessing: if contrast ratio is below 2:1 or estimated DPI is below 150, the job is flagged with a low_quality warning included in the response.

Text Extraction and Layout Analysis

The OCR engine (Tesseract 5 with LSTM models, or a fine-tuned transformer-based model for high-accuracy use cases) runs after preprocessing. The engine returns a hOCR XML document containing per-word bounding boxes and confidence scores. A layout analysis post-processor then:

Groups words into lines using Y-coordinate proximity clustering.
Groups lines into blocks using whitespace gap analysis and font size consistency.
Detects table regions by identifying grid-aligned bounding box patterns.
Determines reading order using a topological sort of block positions weighted by left-to-right, top-to-bottom document conventions, with special handling for multi-column layouts detected via vertical whitespace gaps wider than 20% of page width.

Confidence Scoring

Word-level confidence is provided directly by the OCR engine (Tesseract character confidence aggregated to word level). Block-level confidence is the mean word confidence within the block, weighted by word length to down-weight single-character words that skew the average. Page-level overall_confidence is the 10th percentile word confidence across the page, which better reflects the worst-case extraction quality than the mean. Pages with overall_confidence below 0.60 are flagged for potential human review before being relied upon in automated downstream pipelines.

Scalability

OCR is CPU-bound for standard Tesseract models and GPU-accelerated for transformer-based models. Jobs are queued in a Redis list and consumed by a worker pool. Worker count autoscales based on queue depth: one additional worker per 50 pending pages, up to a maximum of 200 workers. Multi-page documents are split into individual page jobs and processed in parallel, then reassembled in order before writing the final ocr_job record. Page-level parallelism reduces end-to-end latency for 100-page documents from minutes to seconds.

API Design

POST /v1/ocr

Body: document_url or document_data (base64), language (ISO 639-1, default en), output_format (text, json, html, markdown), high_accuracy (boolean, routes to transformer model)
Response: job_id, status (processing or done if cached), estimated_seconds

GET /v1/ocr/{job_id}

Response: job_id, status, page_count, overall_confidence, low_quality_warning (boolean), output (text or structured JSON per page), pages array

GET /v1/ocr/{job_id}/pages/{page_number}

Response: raw_text, words array (text, confidence, bounding_box), blocks array, overall_confidence, processing_ms

Results are stored in S3 and served via pre-signed URLs for large structured outputs exceeding 1MB, avoiding memory pressure on the API tier from large response bodies.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What image preprocessing steps are required before OCR?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Effective preprocessing includes: deskewing (rotating the image to align text horizontally), denoising (Gaussian or median filter to remove sensor noise), binarization (adaptive thresholding to produce a clean black-and-white image), contrast normalization, and resolution upscaling for low-DPI inputs. For photos of documents, perspective correction (homography transform) removes the 3-D distortion introduced by camera angle. Each step is applied conditionally based on detected image quality metrics.”
}
},
{
“@type”: “Question”,
“name”: “How is the text extraction pipeline structured in an OCR service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The pipeline stages are: (1) image ingestion and preprocessing, (2) text region detection (a CNN-based detector like EAST or DB-Net segments the image into word or line bounding boxes), (3) recognition (a sequence model such as CRNN with CTC loss reads each cropped region into a character sequence), (4) post-processing (dictionary lookup, language-model rescoring, and spell correction), and (5) output serialization to the requested format (plain text, hOCR, or structured JSON with bounding box coordinates).”
}
},
{
“@type”: “Question”,
“name”: “How does layout analysis work in an OCR service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Layout analysis identifies the logical structure of a document page: paragraphs, columns, tables, headers, footers, images, and reading order. A document layout model (e.g., LayoutParser or a fine-tuned LayoutLM variant) segments the page into regions and classifies each region type. Table cells are detected and their row/column assignments inferred from bounding box geometry. Reading order is determined by a topological sort over region bounding boxes consistent with left-to-right, top-to-bottom flow.”
}
},
{
“@type”: “Question”,
“name”: “How is confidence scoring used in an OCR service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The recognition model outputs a probability for each character or word. Per-word confidence is the product (or geometric mean) of character probabilities. Low-confidence words are flagged in the output with their bounding boxes so downstream consumers can decide whether to accept, discard, or route them to human review. Aggregate page-level confidence drives SLA routing: high-confidence pages are delivered immediately; low-confidence pages trigger a fallback to a more expensive secondary model or a human transcription queue.”
}
}
]
}