What Is an Address Validation Service?
An address validation service takes raw user-entered address text, standardizes it to a canonical form, verifies it against authoritative postal data, enriches it with geocoordinates, and assigns a deliverability score. It is a critical component in e-commerce checkout, shipping systems, and CRM data hygiene pipelines. At low level it must handle abbreviation normalization, external API circuit breaking, fuzzy correction suggestions, and aggressive caching since validated addresses rarely change.
Validation Pipeline
Each raw address flows through five stages in sequence:
- Parse: tokenize the raw input into components — street number, street name, unit, city, state, zip, country.
- Standardize: normalize abbreviations (St → Street, Ave → Avenue, Apt → Apartment), uppercase state codes, zero-pad zip codes.
- Verify: call external postal verification API (SmartyStreets or USPS) to confirm the address exists in the postal authority database.
- Geocode: resolve the verified address to a latitude/longitude coordinate pair for proximity queries.
- Score deliverability: classify as residential, commercial, vacant, or active delivery point based on API response flags.
Standardization Rules
Street type abbreviations follow USPS Publication 28 standards. Common mappings:
- St → Street, Ave → Avenue, Blvd → Boulevard, Dr → Drive, Ln → Lane, Ct → Court, Rd → Road
- N/S/E/W directionals are expanded: N → North, S → South, E → East, W → West
- Unit designators: Apt → Apartment, Ste → Suite, Fl → Floor
Normalization is applied before the external API call to improve cache hit rates — two inputs that differ only in abbreviation style should map to the same cache key.
Schema
CREATE TABLE Address (
id BIGSERIAL PRIMARY KEY,
raw_input TEXT NOT NULL,
line1 VARCHAR(255),
line2 VARCHAR(128),
city VARCHAR(128),
state VARCHAR(64),
zip VARCHAR(16),
country VARCHAR(8) NOT NULL DEFAULT 'US',
lat NUMERIC(10, 7),
lng NUMERIC(10, 7),
deliverability VARCHAR(32), -- 'deliverable', 'undeliverable', 'vacant', 'unknown'
validated_at TIMESTAMPTZ
);
CREATE TABLE AddressValidationCache (
normalized_key VARCHAR(512) PRIMARY KEY,
result JSONB NOT NULL,
cached_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_address_geo ON Address USING GIST (
ll_to_earth(lat, lng)
);
External API Integration with Circuit Breaker
The SmartyStreets or USPS API is a synchronous dependency. Network failures must not cascade into checkout failures. A circuit breaker wraps the API call:
- Closed: normal operation, requests pass through.
- Open: after N consecutive failures within a window, the breaker opens and all calls fast-fail with a cached or degraded response.
- Half-open: after a timeout, one probe request is allowed; if it succeeds, the breaker closes.
In the open state, the service returns the standardized address without external verification, flagging deliverability as unknown. This is preferable to blocking checkout entirely.
Python Implementation
import hashlib
import json
import requests
from Levenshtein import distance as levenshtein_distance
from db import get_db
from circuit_breaker import CircuitBreaker
ABBREVIATION_MAP = {
'St': 'Street', 'Ave': 'Avenue', 'Blvd': 'Boulevard',
'Dr': 'Drive', 'Ln': 'Lane', 'Ct': 'Court', 'Rd': 'Road',
'N': 'North', 'S': 'South', 'E': 'East', 'W': 'West',
'Apt': 'Apartment', 'Ste': 'Suite', 'Fl': 'Floor'
}
smarty_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
geocode_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60)
CACHE_TTL_DAYS = 30
def standardize_address(parsed: dict) -> dict:
"""Expand abbreviations in street name and unit designator."""
result = dict(parsed)
tokens = result.get('street_name', '').split()
expanded = [ABBREVIATION_MAP.get(t, t) for t in tokens]
result['street_name'] = ' '.join(expanded)
if result.get('unit_designator') in ABBREVIATION_MAP:
result['unit_designator'] = ABBREVIATION_MAP[result['unit_designator']]
result['state'] = result.get('state', '').upper()
result['zip'] = result.get('zip', '').zfill(5)
return result
def _cache_key(standardized: dict) -> str:
canonical = json.dumps(standardized, sort_keys=True)
return hashlib.sha256(canonical.encode()).hexdigest()
def _get_cached(key: str) -> dict | None:
db = get_db()
row = db.execute("""
SELECT result FROM AddressValidationCache
WHERE normalized_key = %s
AND cached_at > NOW() - INTERVAL '%s days'
""", (key, CACHE_TTL_DAYS)).fetchone()
return row['result'] if row else None
def _set_cached(key: str, result: dict):
db = get_db()
db.execute("""
INSERT INTO AddressValidationCache (normalized_key, result, cached_at)
VALUES (%s, %s, NOW())
ON CONFLICT (normalized_key) DO UPDATE
SET result = EXCLUDED.result, cached_at = NOW()
""", (key, json.dumps(result)))
db.commit()
def geocode_address(standardized: dict) -> tuple[float | None, float | None]:
"""Resolve address to lat/lng via Google Maps Geocoding API."""
address_str = "{line1}, {city}, {state} {zip}".format(**standardized)
with geocode_breaker:
resp = requests.get(
'https://maps.googleapis.com/maps/api/geocode/json',
params={'address': address_str, 'key': 'GOOGLE_API_KEY'},
timeout=3
)
data = resp.json()
if data['status'] == 'OK':
loc = data['results'][0]['geometry']['location']
return loc['lat'], loc['lng']
return None, None
def validate_address(raw_input: str) -> dict:
"""Full validation pipeline: parse, standardize, verify, geocode, score."""
parsed = parse_address(raw_input) # assume external parser (usaddress lib)
standardized = standardize_address(parsed)
cache_key = _cache_key(standardized)
cached = _get_cached(cache_key)
if cached:
return cached
result = {
'line1': standardized.get('address_number', '') + ' ' + standardized.get('street_name', ''),
'line2': standardized.get('unit', ''),
'city': standardized.get('city', ''),
'state': standardized.get('state', ''),
'zip': standardized.get('zip', ''),
'country': 'US',
'deliverability': 'unknown',
'lat': None,
'lng': None
}
with smarty_breaker:
ss_resp = requests.post(
'https://us-street.api.smartystreets.com/street-address',
json=[{'street': result['line1'], 'city': result['city'],
'state': result['state'], 'zipcode': result['zip']}],
params={'auth-id': 'SS_AUTH_ID', 'auth-token': 'SS_AUTH_TOKEN'},
timeout=3
)
candidates = ss_resp.json()
if candidates:
c = candidates[0]
result['deliverability'] = c['analysis'].get('dpv_match_code', 'unknown')
result['line1'] = c['delivery_line_1']
result['zip'] = c['components']['zipcode']
lat, lng = geocode_address(standardized)
result['lat'] = lat
result['lng'] = lng
_set_cached(cache_key, result)
db = get_db()
db.execute("""
INSERT INTO Address (raw_input, line1, line2, city, state, zip, country, lat, lng, deliverability, validated_at)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, NOW())
""", (raw_input, result['line1'], result['line2'], result['city'],
result['state'], result['zip'], result['country'],
result['lat'], result['lng'], result['deliverability']))
db.commit()
return result
def fuzzy_suggest(raw_input: str, known_addresses: list[str], threshold: int = 5) -> list[str]:
"""Return known addresses within Levenshtein distance threshold of raw_input."""
return [
addr for addr in known_addresses
if levenshtein_distance(raw_input.lower(), addr.lower()) <= threshold
]
Geocoding and Proximity Queries
Latitude and longitude are stored using PostgreSQL's earthdistance extension for efficient proximity queries. A query for all addresses within 10 km of a point runs as:
SELECT * FROM Address
WHERE earth_box(ll_to_earth(37.7749, -122.4194), 10000) @> ll_to_earth(lat, lng)
AND earth_distance(ll_to_earth(37.7749, -122.4194), ll_to_earth(lat, lng)) < 10000;
Caching Strategy
Validated addresses are cached for 30 days keyed by the SHA-256 hash of the standardized address JSON. This is long relative to most cache TTLs because addresses change infrequently — streets are renamed or renumbered rarely. The cache key uses the standardized form so pre-normalization typos that resolve to the same address hit the same cache entry. Cache entries are invalidated proactively only when the postal authority issues bulk updates, which can be ingested via USPS monthly address change files.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you handle international addresses when the service is designed for US addresses?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “International addresses require a different pipeline: USPS and SmartyStreets are US-only. For international validation, services like Google Address Validation API, Loqate, or Melissa Data support global coverage. The pipeline detects country from the raw input or a separate country field, routes to the appropriate provider, and stores results in the same Address table with country code populated. Standardization rules are country-specific (UK postcodes, Canadian postal codes, etc.) and loaded from provider-specific configuration.”
}
},
{
“@type”: “Question”,
“name”: “How accurate is geocoding for address validation purposes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Geocoding accuracy varies by provider and address type. Google Maps Geocoding API achieves rooftop-level accuracy for most US addresses, returning a location_type of ROOFTOP. For rural or newly constructed addresses it may return RANGE_INTERPOLATED or GEOMETRIC_CENTER, which are less precise. The accuracy level is stored alongside the coordinates so downstream systems can apply different proximity thresholds based on accuracy. For shipping, rooftop accuracy is required; for analytics, GEOMETRIC_CENTER is sufficient.”
}
},
{
“@type”: “Question”,
“name”: “Why cache validated addresses for 30 days instead of indefinitely?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A 30-day TTL balances freshness against cost. Addresses do change: buildings are demolished, streets renamed, zip code boundaries redrawn. Indefinite caching would serve stale deliverability data to new orders. 30 days covers typical re-order cycles while ensuring that significant address changes are picked up within a month. High-traffic addresses (warehouses, popular return addresses) are refreshed proactively by a background job that re-validates any address accessed more than 100 times in the TTL window.”
}
},
{
“@type”: “Question”,
“name”: “How is fuzzy correction confidence thresholded to avoid wrong suggestions?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The Levenshtein distance threshold is tuned per field. For zip codes (5 characters), a threshold of 1 catches single-digit typos without false positives. For street names, a threshold of 3-5 relative to string length (normalized edit distance below 0.2) works well. Suggestions are ranked by distance and presented to the user as options rather than auto-applied, since the system cannot determine user intent from keystroke errors alone. Auto-correction is only applied when exactly one candidate falls within threshold and the deliverability score of that candidate is significantly higher.”
}
}
]
}
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is international address validation handled?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The validation pipeline uses country-specific format schemas; for unsupported countries, the service falls back to basic format checking and geocoding without deliverability scoring.”
}
},
{
“@type”: “Question”,
“name”: “How is the address validation cache keyed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The raw input is normalized (lowercase, whitespace-collapsed, abbreviations expanded) before hashing with SHA-256; the hash is the cache key, ensuring semantically identical addresses hit the same cache entry.”
}
},
{
“@type”: “Question”,
“name”: “How is geocoding accuracy represented?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The Geocoding API returns a location_type field (ROOFTOP, RANGE_INTERPOLATED, GEOMETRIC_CENTER, APPROXIMATE); this is stored in the Address row and used to warn users when lat/lng precision is low.”
}
},
{
“@type”: “Question”,
“name”: “What confidence threshold triggers a fuzzy correction suggestion?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Levenshtein distance below 3 edits and a valid deliverability score on the corrected address triggers a suggestion; higher distances are discarded as too uncertain to recommend.”
}
}
]
}
See also: Shopify Interview Guide
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering