Missing country code

Missing Country Code
Imagine you get a data set from a client that contains addresses from 150 countries all around the world and your task is to verify them, the data is stored in 4 fields – Address Line 1, Address Line 2, City, ZIP code. What you also have available is address verification solution for each country, but the data set does not include the country code. How would you design the logic that would process the data and use the verification components in the most efficient way?
Hint: Running all of those 150 address verification components against each record is not considered efficient.

2026 Update: Phone Number Normalization — A Systems Design Problem

Parsing and normalizing international phone numbers is harder than it looks and appears in systems design interviews at companies handling global users (WhatsApp, Twilio, Stripe, Uber).

import re

# Naive approach (what NOT to do)
def extract_phone_naive(text: str) -> str:
    return re.sub(r'D', '', text)  # Strips all non-digits, ignores country codes

# Better: Use E.164 format (+[country_code][number])
COUNTRY_CODES = {
    '1': 'US/CA',   # NANP: US, Canada, Caribbean
    '44': 'UK',
    '91': 'India',
    '86': 'China',
    '49': 'Germany',
    '33': 'France',
    '81': 'Japan',
    '61': 'Australia',
    '55': 'Brazil',
    '7': 'Russia',
}

def normalize_phone(raw: str, default_country='1') -> dict:
    """
    Normalize phone number to E.164 format.
    Returns dict with country_code, national_number, e164.
    """
    # Strip all formatting
    digits = re.sub(r'[^d+]', '', raw)

    # Determine if international format (+XX...)
    if digits.startswith('+'):
        digits = digits[1:]
        # Try to match country code (1-3 digits)
        for length in [3, 2, 1]:
            code = digits[:length]
            if code in COUNTRY_CODES:
                return {
                    'country_code': code,
                    'national_number': digits[length:],
                    'e164': f'+{digits}',
                    'country': COUNTRY_CODES[code]
                }
        return {'error': 'Unknown country code', 'raw': raw}

    # No explicit country code
    if digits.startswith('00'):
        digits = digits[2:]  # International prefix 00 → strip

    # Try default country
    return {
        'country_code': default_country,
        'national_number': digits,
        'e164': f'+{default_country}{digits}',
        'country': COUNTRY_CODES.get(default_country, 'Unknown')
    }

# Test cases
tests = [
    "+1-800-555-0100",
    "+44 20 7946 0958",
    "0044 20 7946 0958",
    "918001234567",    # Missing country code prefix — ambiguous!
    "(415) 555-0100",  # US local format
    "+91-800-123-4567",
]

for t in tests:
    result = normalize_phone(t)
    print(f"{t!r:35} → {result.get('e164', 'ERROR: ' + result.get('error', ''))}")

Production solution: Use Google’s libphonenumber library (Python: phonenumbers package). It handles 200+ countries, formatting rules, carrier lookup, and validation. The naive regex approach fails on ~20% of real-world phone numbers due to region-specific rules (leading zeros, variable national number lengths, special service numbers).

System design follow-up: “How would you deduplicate 100M users who entered their phone number in different formats?” → Normalize all to E.164 first (strip leading zeros and country code prefixes), then hash. Two E.164 normalized numbers that are identical refer to the same phone.

💡Strategies for Solving This Problem

Finding Missing Element

This is usually "find missing number in array" in disguise. Common variation: array has numbers 1 to n, but one is missing. Find it.

The Setup

Imagine you have phone numbers with country codes 1 to n. One country code is missing from your dataset. Which one?

Approach 1: Sum Formula

Sum of 1 to n = n(n+1)/2

Calculate expected sum, subtract actual sum. The difference is the missing number.

O(n) time, O(1) space. Clean and simple.

Approach 2: XOR

XOR has property: a ⊕ a = 0 and a ⊕ 0 = a

XOR all numbers 1 to n, then XOR all array elements. The result is the missing number.

O(n) time, O(1) space. No overflow risk (unlike sum).

Approach 3: Hash Set

Add all array elements to set. Check which number from 1 to n is not in set.

O(n) time, O(n) space. Works but uses more memory.

Why XOR is Better

Sum formula can overflow if n is large. XOR doesn't have this problem.

XOR is also more versatile - works for finding duplicate, missing, or single number problems.

At Various Companies

This shows up all the time with different stories: missing file number, missing ID, missing floor number, etc. The solution is always the same.

Key question: What if there are multiple missing numbers? Or one duplicate? Each variation has a trick.

Scroll to Top