Parse Phone Numbers Without a Country Code: Heuristic Detection
“Given a list of phone numbers in mixed formats, identify which are missing a country code.” This is a classic Apple / consumer-product engineering interview question that tests string parsing, regular expression skill, and the ability to reason about heuristics under ambiguous data. The base problem is simple — check whether a phone number has a country code prefix — but real-world phone-number data is messy enough that pure rule-based detection misses cases. This guide covers the standard approaches and the trade-offs between strict pattern matching and heuristic-based detection.
Problem Statement
Given a string representing a phone number in arbitrary format, determine whether it includes a country code. Country codes typically range from 1 to 3 digits and may be prefixed with “+” or “00”.
Examples:
"+1 555-123-4567"→ has country code (1 = US)"555-123-4567"→ missing country code"+44 20 7123 4567"→ has country code (44 = UK)"00 49 30 12345678"→ has country code (49 = Germany, prefix 00)"(555) 123-4567"→ missing country code
Approach 1: Strict Prefix Detection
Check for “+” or “00” at the start. Naive but works for clean data.
def has_country_code_strict(phone: str) -> bool:
"""Detect country code via prefix."""
s = phone.strip()
return s.startswith("+") or s.startswith("00")
# Tests
print(has_country_code_strict("+1 555-123-4567")) # True
print(has_country_code_strict("555-123-4567")) # False
print(has_country_code_strict("00 49 30 12345678")) # True
print(has_country_code_strict("(555) 123-4567")) # False
Acceptable for input with consistent formatting. Misses cases where the country code is implied without an explicit “+” — e.g., “1-800-…” in US format where “1” is the country code, or a leading “1” before a 10-digit number.
Approach 2: Length-Based Heuristic
Count the digits in the number. Most national formats have 7-10 digits; numbers with 11+ digits often imply an embedded country code.
import re
def count_digits(phone: str) -> int:
return len(re.sub(r"D", "", phone))
def has_country_code_heuristic(phone: str) -> bool:
"""Heuristic combining prefix and length."""
s = phone.strip()
if s.startswith("+") or s.startswith("00"):
return True
digit_count = count_digits(s)
# US format with leading 1 (11 digits): country code present
if digit_count == 11 and re.sub(r"D", "", s).startswith("1"):
return True
# Strictly more than 10 digits: probably has country code
if digit_count > 10:
return True
return False
# Tests
print(has_country_code_heuristic("+1 555-123-4567")) # True
print(has_country_code_heuristic("1-555-123-4567")) # True (11 digits, leading 1)
print(has_country_code_heuristic("555-123-4567")) # False (10 digits, no prefix)
print(has_country_code_heuristic("(555) 123-4567")) # False
The heuristic catches “1-555-…” style US numbers without explicit “+”. Trade-off: heuristics are imperfect; some valid 10-digit US numbers without country code may be ambiguous if they happen to start with a country-code-like digit.
Approach 3: Library-Based (Production)
For real-world phone-number parsing, use Google’s libphonenumber. It handles every country’s formats, normalizes input, and provides validation.
# pip install phonenumbers
import phonenumbers
def has_country_code_lib(phone: str, default_region: str = "US") -> bool:
"""Use libphonenumber for production-quality parsing."""
try:
# Parse with default_region as fallback
parsed = phonenumbers.parse(phone, default_region)
# If the parsed country code matches the default, no explicit country code was given
explicit_country_code = phonenumbers.parse(phone, None)
return explicit_country_code.country_code is not None
except phonenumbers.phonenumberutil.NumberParseException:
return False
For interview purposes, mention libphonenumber as the production answer; implement a manual heuristic to demonstrate skill.
Approach 4: Country Code Lookup
Maintain a list of valid country codes (1, 7, 20, 27, 30, 31, …, 998). Strip non-digits; check whether the leading digits match any valid country code.
COUNTRY_CODES = {
"1", "7", "20", "27", "30", "31", "32", "33", "34", "36", "39", "40", "41",
"43", "44", "45", "46", "47", "48", "49", "51", "52", "53", "54", "55",
"56", "57", "58", "60", "61", "62", "63", "64", "65", "66", "81", "82",
"84", "86", "90", "91", "92", "93", "94", "95", "98",
# ... (full list ~250 codes)
}
def has_country_code_lookup(phone: str) -> bool:
digits = re.sub(r"D", "", phone)
# Try 1, 2, 3-digit prefixes
for length in (1, 2, 3):
if digits[:length] in COUNTRY_CODES:
# Validate remaining length
remaining = len(digits) - length
if 7 <= remaining <= 12: # plausible national number length
return True
return False
More accurate than length-based heuristic but requires maintaining the country-code list. libphonenumber is preferable for production.
Common Variations
Format / normalize phone numbers
Convert various formats to a canonical E.164 format (e.g., +14155551234). libphonenumber’s format_number handles this.
Detect emergency numbers
Numbers like 911 (US), 999 (UK), 112 (EU). Special-case these; they don’t follow standard country-code rules.
Detect short codes / SMS shortcodes
Marketing shortcodes (4-6 digits) aren’t traditional phone numbers. Different validation; libphonenumber distinguishes them.
Detect VoIP / mobile / landline
By number prefix within a country, you can sometimes infer line type. libphonenumber provides this; rule-based detection is fragile.
Common Mistakes
- Assuming all phone numbers are 10 digits. US/Canada are 10; UK is 10–11; Germany is 10–13; Sweden is 7–10. National lengths vary widely.
- Not normalizing whitespace and separators. Strip all non-digits before pattern matching, except the leading “+” which is meaningful.
- Confusing “+” with “00” prefixes. Both indicate international format. “+” is more common; “00” is older but still seen in Europe.
- Assuming “1” is always a US country code prefix. “1” is also the start of valid 10-digit US numbers (just rare). Length disambiguates: 11 digits starting with 1 → has country code; 10 digits → no country code.
- Hand-rolling regex when libphonenumber exists. For production, use libphonenumber. Hand-rolled regex misses edge cases (extensions, special services, regional formats). For interview purposes, hand-rolling demonstrates skill, but mention the production answer.
Frequently Asked Questions
What’s the expected interview answer?
Combine prefix detection (“+”, “00”) with digit-count heuristics (11+ digits typically indicates country code). Mention libphonenumber for production; implement the heuristic to demonstrate skill. Walk through edge cases: 10-digit US, 11-digit US-with-1, “+44”, “00 49”. Strong candidates anticipate that real data is messy and design for it.
Why is this hard?
Phone-number formats vary by country, region, and provider. Real-world data is dirty: missing prefixes, mixed separators, inconsistent formatting. A clean rule that works for one country fails for another. The interview question tests whether you account for this messiness instead of writing a one-size-fits-all rule.
How accurate are heuristic approaches?
For most US-centric data: 95%+ accurate with simple heuristics. For globally-mixed data: drops to ~80% with hand-rolled rules. Library-based parsing (libphonenumber) approaches 99% accuracy when given correct default region. Interview answers should mention these accuracy bounds; don’t claim 100%.
What about “extensions” like “+1 555-123-4567 ext. 200”?
Extensions are part of the original phone number’s format but typically separate from the routable number. libphonenumber parses them as a distinct field. For interview purposes, strip “ext.”, “x”, “extension” before processing the main number.
How does this generalize to other phone-number tasks?
The same parsing pipeline (strip non-digits → check prefix → look up country code → validate remaining length) applies to formatting, validating, and routing. Building a robust phone-number system is a real engineering investment; the interview question is a small slice of that broader problem.
See also: Remove a Character from a String • Print String Permutations • Boolean String Value Parsing
💡Strategies for Solving This Problem
Finding Missing Element
This is usually "find missing number in array" in disguise. Common variation: array has numbers 1 to n, but one is missing. Find it.
The Setup
Imagine you have phone numbers with country codes 1 to n. One country code is missing from your dataset. Which one?
Approach 1: Sum Formula
Sum of 1 to n = n(n+1)/2
Calculate expected sum, subtract actual sum. The difference is the missing number.
O(n) time, O(1) space. Clean and simple.
Approach 2: XOR
XOR has property: a ⊕ a = 0 and a ⊕ 0 = a
XOR all numbers 1 to n, then XOR all array elements. The result is the missing number.
O(n) time, O(1) space. No overflow risk (unlike sum).
Approach 3: Hash Set
Add all array elements to set. Check which number from 1 to n is not in set.
O(n) time, O(n) space. Works but uses more memory.
Why XOR is Better
Sum formula can overflow if n is large. XOR doesn't have this problem.
XOR is also more versatile - works for finding duplicate, missing, or single number problems.
At Various Companies
This shows up all the time with different stories: missing file number, missing ID, missing floor number, etc. The solution is always the same.
Key question: What if there are multiple missing numbers? Or one duplicate? Each variation has a trick.