Detect Phone Numbers Missing a Country Code: Heuristics and libphonenumber

Parse Phone Numbers Without a Country Code: Heuristic Detection

“Given a list of phone numbers in mixed formats, identify which are missing a country code.” This is a classic Apple / consumer-product engineering interview question that tests string parsing, regular expression skill, and the ability to reason about heuristics under ambiguous data. The base problem is simple — check whether a phone number has a country code prefix — but real-world phone-number data is messy enough that pure rule-based detection misses cases. This guide covers the standard approaches and the trade-offs between strict pattern matching and heuristic-based detection.

Problem Statement

Given a string representing a phone number in arbitrary format, determine whether it includes a country code. Country codes typically range from 1 to 3 digits and may be prefixed with “+” or “00”.

Examples:

  • "+1 555-123-4567" → has country code (1 = US)
  • "555-123-4567" → missing country code
  • "+44 20 7123 4567" → has country code (44 = UK)
  • "00 49 30 12345678" → has country code (49 = Germany, prefix 00)
  • "(555) 123-4567" → missing country code

Approach 1: Strict Prefix Detection

Check for “+” or “00” at the start. Naive but works for clean data.

def has_country_code_strict(phone: str) -> bool:
    """Detect country code via prefix."""
    s = phone.strip()
    return s.startswith("+") or s.startswith("00")


# Tests
print(has_country_code_strict("+1 555-123-4567"))      # True
print(has_country_code_strict("555-123-4567"))          # False
print(has_country_code_strict("00 49 30 12345678"))     # True
print(has_country_code_strict("(555) 123-4567"))        # False

Acceptable for input with consistent formatting. Misses cases where the country code is implied without an explicit “+” — e.g., “1-800-…” in US format where “1” is the country code, or a leading “1” before a 10-digit number.

Approach 2: Length-Based Heuristic

Count the digits in the number. Most national formats have 7-10 digits; numbers with 11+ digits often imply an embedded country code.

import re

def count_digits(phone: str) -> int:
    return len(re.sub(r"D", "", phone))


def has_country_code_heuristic(phone: str) -> bool:
    """Heuristic combining prefix and length."""
    s = phone.strip()
    if s.startswith("+") or s.startswith("00"):
        return True
    digit_count = count_digits(s)
    # US format with leading 1 (11 digits): country code present
    if digit_count == 11 and re.sub(r"D", "", s).startswith("1"):
        return True
    # Strictly more than 10 digits: probably has country code
    if digit_count > 10:
        return True
    return False


# Tests
print(has_country_code_heuristic("+1 555-123-4567"))       # True
print(has_country_code_heuristic("1-555-123-4567"))         # True (11 digits, leading 1)
print(has_country_code_heuristic("555-123-4567"))           # False (10 digits, no prefix)
print(has_country_code_heuristic("(555) 123-4567"))         # False

The heuristic catches “1-555-…” style US numbers without explicit “+”. Trade-off: heuristics are imperfect; some valid 10-digit US numbers without country code may be ambiguous if they happen to start with a country-code-like digit.

Approach 3: Library-Based (Production)

For real-world phone-number parsing, use Google’s libphonenumber. It handles every country’s formats, normalizes input, and provides validation.

# pip install phonenumbers
import phonenumbers

def has_country_code_lib(phone: str, default_region: str = "US") -> bool:
    """Use libphonenumber for production-quality parsing."""
    try:
        # Parse with default_region as fallback
        parsed = phonenumbers.parse(phone, default_region)
        # If the parsed country code matches the default, no explicit country code was given
        explicit_country_code = phonenumbers.parse(phone, None)
        return explicit_country_code.country_code is not None
    except phonenumbers.phonenumberutil.NumberParseException:
        return False

For interview purposes, mention libphonenumber as the production answer; implement a manual heuristic to demonstrate skill.

Approach 4: Country Code Lookup

Maintain a list of valid country codes (1, 7, 20, 27, 30, 31, …, 998). Strip non-digits; check whether the leading digits match any valid country code.

COUNTRY_CODES = {
    "1", "7", "20", "27", "30", "31", "32", "33", "34", "36", "39", "40", "41",
    "43", "44", "45", "46", "47", "48", "49", "51", "52", "53", "54", "55",
    "56", "57", "58", "60", "61", "62", "63", "64", "65", "66", "81", "82",
    "84", "86", "90", "91", "92", "93", "94", "95", "98",
    # ... (full list ~250 codes)
}

def has_country_code_lookup(phone: str) -> bool:
    digits = re.sub(r"D", "", phone)
    # Try 1, 2, 3-digit prefixes
    for length in (1, 2, 3):
        if digits[:length] in COUNTRY_CODES:
            # Validate remaining length
            remaining = len(digits) - length
            if 7 <= remaining <= 12:  # plausible national number length
                return True
    return False

More accurate than length-based heuristic but requires maintaining the country-code list. libphonenumber is preferable for production.

Common Variations

Format / normalize phone numbers

Convert various formats to a canonical E.164 format (e.g., +14155551234). libphonenumber’s format_number handles this.

Detect emergency numbers

Numbers like 911 (US), 999 (UK), 112 (EU). Special-case these; they don’t follow standard country-code rules.

Detect short codes / SMS shortcodes

Marketing shortcodes (4-6 digits) aren’t traditional phone numbers. Different validation; libphonenumber distinguishes them.

Detect VoIP / mobile / landline

By number prefix within a country, you can sometimes infer line type. libphonenumber provides this; rule-based detection is fragile.

Common Mistakes

  • Assuming all phone numbers are 10 digits. US/Canada are 10; UK is 10–11; Germany is 10–13; Sweden is 7–10. National lengths vary widely.
  • Not normalizing whitespace and separators. Strip all non-digits before pattern matching, except the leading “+” which is meaningful.
  • Confusing “+” with “00” prefixes. Both indicate international format. “+” is more common; “00” is older but still seen in Europe.
  • Assuming “1” is always a US country code prefix. “1” is also the start of valid 10-digit US numbers (just rare). Length disambiguates: 11 digits starting with 1 → has country code; 10 digits → no country code.
  • Hand-rolling regex when libphonenumber exists. For production, use libphonenumber. Hand-rolled regex misses edge cases (extensions, special services, regional formats). For interview purposes, hand-rolling demonstrates skill, but mention the production answer.

Frequently Asked Questions

What’s the expected interview answer?

Combine prefix detection (“+”, “00”) with digit-count heuristics (11+ digits typically indicates country code). Mention libphonenumber for production; implement the heuristic to demonstrate skill. Walk through edge cases: 10-digit US, 11-digit US-with-1, “+44”, “00 49”. Strong candidates anticipate that real data is messy and design for it.

Why is this hard?

Phone-number formats vary by country, region, and provider. Real-world data is dirty: missing prefixes, mixed separators, inconsistent formatting. A clean rule that works for one country fails for another. The interview question tests whether you account for this messiness instead of writing a one-size-fits-all rule.

How accurate are heuristic approaches?

For most US-centric data: 95%+ accurate with simple heuristics. For globally-mixed data: drops to ~80% with hand-rolled rules. Library-based parsing (libphonenumber) approaches 99% accuracy when given correct default region. Interview answers should mention these accuracy bounds; don’t claim 100%.

What about “extensions” like “+1 555-123-4567 ext. 200”?

Extensions are part of the original phone number’s format but typically separate from the routable number. libphonenumber parses them as a distinct field. For interview purposes, strip “ext.”, “x”, “extension” before processing the main number.

How does this generalize to other phone-number tasks?

The same parsing pipeline (strip non-digits → check prefix → look up country code → validate remaining length) applies to formatting, validating, and routing. Building a robust phone-number system is a real engineering investment; the interview question is a small slice of that broader problem.

See also: Remove a Character from a StringPrint String PermutationsBoolean String Value Parsing

💡Strategies for Solving This Problem

Finding Missing Element

This is usually "find missing number in array" in disguise. Common variation: array has numbers 1 to n, but one is missing. Find it.

The Setup

Imagine you have phone numbers with country codes 1 to n. One country code is missing from your dataset. Which one?

Approach 1: Sum Formula

Sum of 1 to n = n(n+1)/2

Calculate expected sum, subtract actual sum. The difference is the missing number.

O(n) time, O(1) space. Clean and simple.

Approach 2: XOR

XOR has property: a ⊕ a = 0 and a ⊕ 0 = a

XOR all numbers 1 to n, then XOR all array elements. The result is the missing number.

O(n) time, O(1) space. No overflow risk (unlike sum).

Approach 3: Hash Set

Add all array elements to set. Check which number from 1 to n is not in set.

O(n) time, O(n) space. Works but uses more memory.

Why XOR is Better

Sum formula can overflow if n is large. XOR doesn't have this problem.

XOR is also more versatile - works for finding duplicate, missing, or single number problems.

At Various Companies

This shows up all the time with different stories: missing file number, missing ID, missing floor number, etc. The solution is always the same.

Key question: What if there are multiple missing numbers? Or one duplicate? Each variation has a trick.

Scroll to Top