Address Validation Service Low-Level Design: Normalization, Geocoding, and Delivery Verification

What Is an Address Validation Service?

An address validation service takes raw user-entered address text, standardizes it to a canonical form, verifies it against authoritative postal data, enriches it with geocoordinates, and assigns a deliverability score. It is a critical component in e-commerce checkout, shipping systems, and CRM data hygiene pipelines. At low level it must handle abbreviation normalization, external API circuit breaking, fuzzy correction suggestions, and aggressive caching since validated addresses rarely change.

Validation Pipeline

Each raw address flows through five stages in sequence:

  1. Parse: tokenize the raw input into components — street number, street name, unit, city, state, zip, country.
  2. Standardize: normalize abbreviations (St → Street, Ave → Avenue, Apt → Apartment), uppercase state codes, zero-pad zip codes.
  3. Verify: call external postal verification API (SmartyStreets or USPS) to confirm the address exists in the postal authority database.
  4. Geocode: resolve the verified address to a latitude/longitude coordinate pair for proximity queries.
  5. Score deliverability: classify as residential, commercial, vacant, or active delivery point based on API response flags.

Standardization Rules

Street type abbreviations follow USPS Publication 28 standards. Common mappings:

  • St → Street, Ave → Avenue, Blvd → Boulevard, Dr → Drive, Ln → Lane, Ct → Court, Rd → Road
  • N/S/E/W directionals are expanded: N → North, S → South, E → East, W → West
  • Unit designators: Apt → Apartment, Ste → Suite, Fl → Floor

Normalization is applied before the external API call to improve cache hit rates — two inputs that differ only in abbreviation style should map to the same cache key.

Schema

CREATE TABLE Address (
  id              BIGSERIAL PRIMARY KEY,
  raw_input       TEXT NOT NULL,
  line1           VARCHAR(255),
  line2           VARCHAR(128),
  city            VARCHAR(128),
  state           VARCHAR(64),
  zip             VARCHAR(16),
  country         VARCHAR(8) NOT NULL DEFAULT 'US',
  lat             NUMERIC(10, 7),
  lng             NUMERIC(10, 7),
  deliverability  VARCHAR(32),  -- 'deliverable', 'undeliverable', 'vacant', 'unknown'
  validated_at    TIMESTAMPTZ
);

CREATE TABLE AddressValidationCache (
  normalized_key  VARCHAR(512) PRIMARY KEY,
  result          JSONB NOT NULL,
  cached_at       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_address_geo ON Address USING GIST (
  ll_to_earth(lat, lng)
);

External API Integration with Circuit Breaker

The SmartyStreets or USPS API is a synchronous dependency. Network failures must not cascade into checkout failures. A circuit breaker wraps the API call:

  • Closed: normal operation, requests pass through.
  • Open: after N consecutive failures within a window, the breaker opens and all calls fast-fail with a cached or degraded response.
  • Half-open: after a timeout, one probe request is allowed; if it succeeds, the breaker closes.

In the open state, the service returns the standardized address without external verification, flagging deliverability as unknown. This is preferable to blocking checkout entirely.

Python Implementation

import hashlib
import json
import requests
from Levenshtein import distance as levenshtein_distance
from db import get_db
from circuit_breaker import CircuitBreaker

ABBREVIATION_MAP = {
    'St': 'Street', 'Ave': 'Avenue', 'Blvd': 'Boulevard',
    'Dr': 'Drive', 'Ln': 'Lane', 'Ct': 'Court', 'Rd': 'Road',
    'N': 'North', 'S': 'South', 'E': 'East', 'W': 'West',
    'Apt': 'Apartment', 'Ste': 'Suite', 'Fl': 'Floor'
}

smarty_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
geocode_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60)

CACHE_TTL_DAYS = 30

def standardize_address(parsed: dict) -> dict:
    """Expand abbreviations in street name and unit designator."""
    result = dict(parsed)
    tokens = result.get('street_name', '').split()
    expanded = [ABBREVIATION_MAP.get(t, t) for t in tokens]
    result['street_name'] = ' '.join(expanded)
    if result.get('unit_designator') in ABBREVIATION_MAP:
        result['unit_designator'] = ABBREVIATION_MAP[result['unit_designator']]
    result['state'] = result.get('state', '').upper()
    result['zip'] = result.get('zip', '').zfill(5)
    return result

def _cache_key(standardized: dict) -> str:
    canonical = json.dumps(standardized, sort_keys=True)
    return hashlib.sha256(canonical.encode()).hexdigest()

def _get_cached(key: str) -> dict | None:
    db = get_db()
    row = db.execute("""
        SELECT result FROM AddressValidationCache
        WHERE normalized_key = %s
          AND cached_at > NOW() - INTERVAL '%s days'
    """, (key, CACHE_TTL_DAYS)).fetchone()
    return row['result'] if row else None

def _set_cached(key: str, result: dict):
    db = get_db()
    db.execute("""
        INSERT INTO AddressValidationCache (normalized_key, result, cached_at)
        VALUES (%s, %s, NOW())
        ON CONFLICT (normalized_key) DO UPDATE
          SET result = EXCLUDED.result, cached_at = NOW()
    """, (key, json.dumps(result)))
    db.commit()

def geocode_address(standardized: dict) -> tuple[float | None, float | None]:
    """Resolve address to lat/lng via Google Maps Geocoding API."""
    address_str = "{line1}, {city}, {state} {zip}".format(**standardized)
    with geocode_breaker:
        resp = requests.get(
            'https://maps.googleapis.com/maps/api/geocode/json',
            params={'address': address_str, 'key': 'GOOGLE_API_KEY'},
            timeout=3
        )
        data = resp.json()
        if data['status'] == 'OK':
            loc = data['results'][0]['geometry']['location']
            return loc['lat'], loc['lng']
    return None, None

def validate_address(raw_input: str) -> dict:
    """Full validation pipeline: parse, standardize, verify, geocode, score."""
    parsed = parse_address(raw_input)  # assume external parser (usaddress lib)
    standardized = standardize_address(parsed)
    cache_key = _cache_key(standardized)

    cached = _get_cached(cache_key)
    if cached:
        return cached

    result = {
        'line1': standardized.get('address_number', '') + ' ' + standardized.get('street_name', ''),
        'line2': standardized.get('unit', ''),
        'city':  standardized.get('city', ''),
        'state': standardized.get('state', ''),
        'zip':   standardized.get('zip', ''),
        'country': 'US',
        'deliverability': 'unknown',
        'lat': None,
        'lng': None
    }

    with smarty_breaker:
        ss_resp = requests.post(
            'https://us-street.api.smartystreets.com/street-address',
            json=[{'street': result['line1'], 'city': result['city'],
                   'state': result['state'], 'zipcode': result['zip']}],
            params={'auth-id': 'SS_AUTH_ID', 'auth-token': 'SS_AUTH_TOKEN'},
            timeout=3
        )
        candidates = ss_resp.json()
        if candidates:
            c = candidates[0]
            result['deliverability'] = c['analysis'].get('dpv_match_code', 'unknown')
            result['line1'] = c['delivery_line_1']
            result['zip']   = c['components']['zipcode']

    lat, lng = geocode_address(standardized)
    result['lat'] = lat
    result['lng']  = lng

    _set_cached(cache_key, result)

    db = get_db()
    db.execute("""
        INSERT INTO Address (raw_input, line1, line2, city, state, zip, country, lat, lng, deliverability, validated_at)
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, NOW())
    """, (raw_input, result['line1'], result['line2'], result['city'],
          result['state'], result['zip'], result['country'],
          result['lat'], result['lng'], result['deliverability']))
    db.commit()

    return result

def fuzzy_suggest(raw_input: str, known_addresses: list[str], threshold: int = 5) -> list[str]:
    """Return known addresses within Levenshtein distance threshold of raw_input."""
    return [
        addr for addr in known_addresses
        if levenshtein_distance(raw_input.lower(), addr.lower()) <= threshold
    ]

Geocoding and Proximity Queries

Latitude and longitude are stored using PostgreSQL's earthdistance extension for efficient proximity queries. A query for all addresses within 10 km of a point runs as:

SELECT * FROM Address
WHERE earth_box(ll_to_earth(37.7749, -122.4194), 10000) @> ll_to_earth(lat, lng)
  AND earth_distance(ll_to_earth(37.7749, -122.4194), ll_to_earth(lat, lng)) < 10000;

Caching Strategy

Validated addresses are cached for 30 days keyed by the SHA-256 hash of the standardized address JSON. This is long relative to most cache TTLs because addresses change infrequently — streets are renamed or renumbered rarely. The cache key uses the standardized form so pre-normalization typos that resolve to the same address hit the same cache entry. Cache entries are invalidated proactively only when the postal authority issues bulk updates, which can be ingested via USPS monthly address change files.

See also: Shopify Interview Guide

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top