Contact Book Service Low-Level Design: Contact Storage, Deduplication, and Group Management

Contact Schema

A contact represents a person or organization owned by a user. Multi-value fields (phones, emails, addresses) are stored as JSON arrays or in child tables:

contacts {
  contact_id    UUID PK
  owner_user_id UUID FK
  display_name  VARCHAR(255)
  birthday      DATE nullable
  notes         TEXT
  photo_url     VARCHAR(500) nullable
  source        ENUM(manual, google, linkedin, csv_import)
  external_ids  JSONB   -- {"google": "abc123", "linkedin": "xyz"}
  created_at    TIMESTAMP
  updated_at    TIMESTAMP
}

contact_phones   { contact_id, number VARCHAR(50), type ENUM(mobile,work,home), primary BOOLEAN }
contact_emails   { contact_id, address VARCHAR(255), type ENUM(work,personal), primary BOOLEAN }
contact_addresses{ contact_id, street, city, state, country, postal_code, type ENUM(home,work) }
contact_tags     { contact_id, tag VARCHAR(100) }

The primary flag on phones and emails designates the default contact value shown in list views.

Multiple Values Per Field

Each contact can have multiple phone numbers, emails, and addresses, each typed and flagged as primary or secondary. On display, the primary entry is shown first. When searching by phone number, all numbers for the contact are indexed. The primary flag is enforced as unique per contact in the application layer (only one primary per type).

Deduplication

Fuzzy duplicate detection runs at import time and on-demand. The algorithm compares candidates using two signals:

Phonetic name similarity: apply Metaphone or Soundex to display_name; contacts with matching phonetic codes are candidates
Exact phone or email match: if any phone number or email address matches exactly after normalization, score as high-confidence duplicate

A similarity score is computed: exact phone/email match = 1.0; phonetic name match only = 0.6; both = 1.0. Pairs above a configurable threshold (default 0.7) are surfaced to the user as duplicate suggestions, not auto-merged.

Contact Merge

When the user confirms a merge, the system combines the two contacts into one canonical record:

Union of all unique phone numbers, emails, and addresses (deduplicate exact matches)
User selects the canonical display_name
Union of all tags
Merge external_ids maps
Keep the earliest created_at
Delete the secondary contact record; update any group memberships to point to the merged contact

Contact Groups

Two grouping mechanisms are supported:

Tag-based (flexible): contacts are tagged via contact_tags; querying by tag returns all matching contacts; contacts can have unlimited tags
Named groups (explicit): a named group has a membership list stored separately

contact_groups        { group_id, owner_user_id, name, created_at }
contact_group_members { group_id, contact_id }

A contact can belong to multiple groups. Group membership is many-to-many.

Import from CSV

CSV import presents the user with a column-mapping UI: each CSV header is mapped to a contact field (display_name, phone, email, etc.). The import engine:

Parses rows and maps values to contact fields per the mapping
Normalizes phone numbers (strip dashes, spaces, parentheses; apply E.164 formatting)
Runs deduplication check: if an existing contact matches on normalized email, skip and log as duplicate
Inserts new contacts in a batch transaction; returns a summary: created, skipped (duplicates), errors

Import from vCard

vCard (.vcf) files are parsed per RFC 6350. The parser handles versions 2.1, 3.0, and 4.0. Key fields extracted: FN (display name), TEL, EMAIL, ADR, BDAY, NOTE, PHOTO (stored to object storage, URL saved to photo_url). Multiple VEVENTs in a single .vcf file are imported as separate contacts.

Export to vCard

Individual contacts or full contact books can be exported as .vcf files. The export generates v3.0 vCards by default for maximum compatibility. Group exports create a single .vcf with multiple VCARD blocks.

Contact Sharing

A contact can be shared via a generated public URL. The share record:

contact_shares { share_token VARCHAR(64) PK, contact_id, shared_by, fields_included[], expires_at, view_count }

The URL /share/{share_token} returns a limited view of the contact (only fields listed in fields_included). Expiry is configurable: after first view, after N views, or after a time duration.

Search and Phone Normalization

Full-text search indexes display_name, all phone numbers, and all email addresses. Phone numbers are normalized before indexing: strip all non-digit characters, remove leading country codes, store the canonical E.164 form. This ensures that searching for “555-1234” finds a contact stored as “+1 (555) 123-4” if they resolve to the same digits.

Birthday Reminders

A daily scheduled job queries contacts where birthday falls within the next 7 days (matching month and day, ignoring year). It generates in-app notifications and optional email reminders for the contact owner. The job runs at 08:00 in the user's configured timezone.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What data model would you use to store contacts that can have multiple phone numbers, emails, and addresses?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use a normalized schema: a contacts table (id, owner_id, display_name, created_at, updated_at) with child tables — contact_emails (id, contact_id, email, label, is_primary), contact_phones (id, contact_id, e164_phone, label, is_primary), contact_addresses (id, contact_id, street, city, region, postal_code, country, label). Store phone numbers in E.164 format for consistent lookup and SMS/call integration. Index on (owner_id, email) and (owner_id, e164_phone) for fast search. Use a single contacts_fts table or a search engine index for full-text name search.”
}
},
{
“@type”: “Question”,
“name”: “How do you detect and merge duplicate contacts automatically?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Generate a set of deduplication signals for each contact: normalized E.164 phone numbers, lowercased email addresses, and a phonetic name key (e.g., Soundex or Double Metaphone of the display name). Store these signals in a contact_signals table. On insert or update, query for contacts with overlapping signals. Score candidate pairs: exact email or phone match = strong duplicate (merge automatically or flag for review); name phonetic match alone = weak signal (suggest to user). When merging, designate one record as the canonical contact, re-point all foreign key references (e.g., group memberships, interaction history) to the canonical ID, and soft-delete the duplicate.”
}
},
{
“@type”: “Question”,
“name”: “How would you design contact group management so a contact can belong to many groups and groups can be nested?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use a contact_groups table (id, owner_id, name, parent_group_id nullable) for the group hierarchy and a contact_group_members junction table (contact_id, group_id) for membership. parent_group_id forms a tree; enforce no cycles at the application layer. To query all contacts in a group including descendants, use a recursive CTE (WITH RECURSIVE) to collect all descendant group IDs, then JOIN with contact_group_members. Cache the flattened membership list in Redis for frequently accessed large groups and invalidate on membership or hierarchy changes. Limit nesting depth (e.g., 5 levels) to bound recursive query cost.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement fast full-text contact search across name, email, and phone for a user with 100,000 contacts?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Maintain a search index per user, either using PostgreSQL's tsvector full-text search or a dedicated search engine (Elasticsearch, Typesense). For PostgreSQL, store a generated tsvector column on the contact combining display_name tokens (weight A), email prefixes (weight B), and normalized phone digits (weight C). Create a GIN index on this column. Query with to_tsquery and prefix matching (:*) for autocomplete. For phone lookups, store digits-only strings and match with LIKE '1234%' against an indexed column. At 100K contacts per user, PostgreSQL handles this well; migrate to a dedicated search cluster only when query latency degrades or tenant counts make shared index size unmanageable.”
}
}
]
}