Low Level Design: SMS Gateway – Tech Interview Dot Org

Overview

An SMS gateway is the internal service that abstracts one or more upstream SMS providers (Twilio, Nexmo/Vonage, AWS SNS, Sinch) behind a unified API. It handles number provisioning, message queuing, delivery receipt reconciliation, number masking, per-country routing, rate limiting, and cost optimization across providers. Without a gateway layer, every application team integrates directly with a single provider — the gateway centralizes reliability, cost control, and compliance logic.

Data Model

CREATE TABLE sms_providers (
    id          TINYINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    name        VARCHAR(64) NOT NULL,
    api_base    VARCHAR(256) NOT NULL,
    is_active   TINYINT(1) NOT NULL DEFAULT 1,
    priority    TINYINT UNSIGNED NOT NULL DEFAULT 10,
    cost_per_sms DECIMAL(8,6) NOT NULL DEFAULT 0.0
) ENGINE=InnoDB;

CREATE TABLE phone_numbers (
    id          BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    number      VARCHAR(32) NOT NULL,
    provider_id TINYINT UNSIGNED NOT NULL,
    country     CHAR(2) NOT NULL,
    type        ENUM('longcode','shortcode','tollfree','alphanumeric') NOT NULL,
    is_active   TINYINT(1) NOT NULL DEFAULT 1,
    monthly_cost DECIMAL(8,2) NOT NULL DEFAULT 0.0,
    UNIQUE KEY uq_number (number),
    INDEX idx_country_type (country, type, is_active)
) ENGINE=InnoDB;

CREATE TABLE sms_messages (
    id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    external_id     CHAR(36) NOT NULL COMMENT 'UUID exposed to callers',
    from_number     VARCHAR(32) NOT NULL,
    to_number       VARCHAR(32) NOT NULL,
    body            TEXT NOT NULL,
    direction       ENUM('outbound','inbound') NOT NULL DEFAULT 'outbound',
    status          ENUM('queued','sending','sent','delivered','failed','undelivered') NOT NULL DEFAULT 'queued',
    provider_id     TINYINT UNSIGNED,
    provider_msg_id VARCHAR(256),
    segment_count   TINYINT UNSIGNED NOT NULL DEFAULT 1,
    cost            DECIMAL(8,6),
    error_code      VARCHAR(64),
    error_message   VARCHAR(512),
    attempt_count   TINYINT UNSIGNED NOT NULL DEFAULT 0,
    next_retry_at   DATETIME,
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    sent_at         DATETIME,
    delivered_at    DATETIME,
    UNIQUE KEY uq_external (external_id),
    INDEX idx_status_retry (status, next_retry_at),
    INDEX idx_to_number (to_number, created_at),
    INDEX idx_provider_msg (provider_id, provider_msg_id)
) ENGINE=InnoDB;

CREATE TABLE number_masks (
    id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    real_number     VARCHAR(32) NOT NULL,
    masked_number   VARCHAR(32) NOT NULL,
    context_id      VARCHAR(256) NOT NULL COMMENT 'e.g. order_id or session_id',
    expires_at      DATETIME NOT NULL,
    is_active       TINYINT(1) NOT NULL DEFAULT 1,
    INDEX idx_real (real_number, context_id),
    INDEX idx_masked (masked_number),
    INDEX idx_expiry (expires_at, is_active)
) ENGINE=InnoDB;

CREATE TABLE inbound_webhooks (
    id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    provider_id     TINYINT UNSIGNED NOT NULL,
    raw_payload     JSON NOT NULL,
    message_id      BIGINT UNSIGNED,
    processed       TINYINT(1) NOT NULL DEFAULT 0,
    received_at     DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_processed (processed, received_at)
) ENGINE=InnoDB;

The sms_providers table drives routing decisions. phone_numbers is the pool of owned numbers per provider and country. sms_messages is the central ledger — the external_id UUID is the ID returned to callers so internal IDs are never exposed. number_masks supports the two-way masking use case (e.g., rider-driver communication). inbound_webhooks stores raw provider callbacks before processing, enabling replay.

Core Algorithm / Workflow

1. Send Request Intake

The caller sends a POST to /v1/messages with to_number, body, and optional from_number or country hint. The gateway:

Validates E.164 format for to_number using a library like libphonenumber.
Checks the number_masks table — if to_number is a masked number, resolve to the real destination and rewrite the request.
Detects the destination country from the to_number prefix (ITU country code lookup table, cached in memory).
Selects a provider and from_number via the routing engine (see below).
Calculates segment_count: a standard SMS is 160 GSM-7 characters or 70 UCS-2 characters. Concatenated messages use 153 / 67 characters per segment. Segment count affects cost and must be precomputed for billing.
Inserts a row into sms_messages with status = 'queued' and returns external_id to the caller synchronously. The actual send is async.
Publishes the message ID to a send queue (Redis list or Kafka topic).

2. Routing Engine

Routing selects the best (provider, from_number) pair for a given destination. The decision tree:

Compliance filter: Some countries require a local registered number or sender ID. Eliminate providers that cannot legally send to the destination country.
Capability filter: Short codes only work domestically in most markets. Filter from_number pool by country and required type (shortcode for high-volume marketing, longcode for transactional).
Cost sort: Rank remaining (provider, from_number) options by cost_per_sms ascending. Pick the cheapest that meets latency SLAs.
Health check: Skip providers with error rate > 5% in the last 5 minutes (computed from a rolling Redis counter). This implements automatic failover.
Load balance: If multiple from_numbers are tied, use round-robin to distribute send volume and avoid carrier flagging high-velocity single numbers.

3. Send Worker

Workers consume from the send queue. Each worker:

Fetches the sms_messages row, verifies status = 'queued' (idempotency guard).
Updates status to 'sending'.
Calls the provider API. Each provider has an adapter implementing a common interface: send(from, to, body) -> (provider_msg_id, error).
On success: update status = 'sent', store provider_msg_id, record sent_at and cost.
On transient failure (5xx, timeout): set status = 'queued', compute next_retry_at with exponential backoff, increment attempt_count. After max attempts, set status = 'failed' and stop.
On permanent failure (invalid number, carrier block): set status = 'undelivered', store error_code, do not retry.
Increment the provider error rate counter in Redis (INCR with 5-minute TTL window).

4. Delivery Receipt Processing

Each provider posts delivery receipts (DLRs) to a webhook URL. The gateway:

Receives the HTTP POST, stores the raw payload in inbound_webhooks, responds 200 immediately.
An async processor reads inbound_webhooks where processed = 0, normalizes the provider-specific payload to a canonical status, looks up sms_messages by (provider_id, provider_msg_id), and updates status and delivered_at.
Fires an outbound webhook to the original caller if they registered a callback URL.

Storing raw payloads before processing is critical: provider webhook schemas change without notice, and stored raw payloads can be replayed against updated parsers.

5. Number Masking

For use cases where two parties should communicate via SMS without revealing their real numbers (marketplace, ride-sharing):

The application calls POST /v1/masks with both real numbers and a context_id (e.g., order_id) and TTL.
The gateway assigns two masked numbers from a pool of provisioned longcode numbers. Each masked number routes to the other real number.
When an inbound SMS arrives on a masked number, the gateway looks up the active mask by (masked_number), finds the other party's real number, and forwards the message.
On expiry or mask deactivation, the numbers return to the pool for reassignment.

Key Design Decisions and Trade-offs

Synchronous intake, async send: The caller gets an external_id immediately without waiting for the provider API. This decouples caller latency from provider SLA variance (Twilio p99 can be 500ms+). The trade-off is that the caller must poll or use webhooks to know final status — acceptable for all real-world SMS use cases.
Provider adapter pattern: Each provider implements a thin adapter. Adding a new provider requires only a new adapter class, not changes to routing or retry logic. The cost is maintaining N adapters and testing each against provider sandbox environments.
Cost-based routing vs. quality-based routing: Cheapest is not always best — some low-cost providers have poor deliverability in certain markets. A weighted score combining cost and recent delivery rate gives better outcomes than pure cost minimization.
Short code vs. long code: Short codes have higher throughput (100 msg/s) and better deliverability for marketing but cost $500-$1000/month and require carrier approval. Long codes are cheaper but limited to 1 msg/s per number (carrier filtering). Use dedicated short codes for high-volume campaigns.
Webhook raw storage: Storing inbound_webhooks.raw_payload adds storage cost but is essential for debugging delivery discrepancies, replaying after parser bugs, and compliance audits.

Failure Handling and Edge Cases

Provider outage: The health check in the routing engine automatically bypasses a failing provider. If all providers for a country are unhealthy, the message stays queued and the caller receives a delayed delivery. Alert on queue depth exceeding a threshold.
Duplicate sends: The send worker checks status = 'queued' before sending. If the worker crashes after the provider accepts the message but before the DB update, the retry will attempt a second send. For providers that support idempotency keys (Twilio), send the sms_messages.id as the idempotency key to prevent double delivery. For providers without this, accept rare duplicates and log them.
DLR storms: After a provider outage resolves, thousands of DLRs may arrive simultaneously. The inbound_webhooks table absorbs the burst; the async processor drains at a controlled rate.
Character encoding issues: Emoji and non-GSM-7 characters force UCS-2 encoding, halving the per-segment capacity. The intake validation must detect encoding and recompute segment_count to avoid billing surprises. Some providers silently truncate or drop non-supported characters — test each adapter against the full Unicode range.
Number porting: A to_number may be ported from one carrier to another, changing routing requirements. Some providers handle this transparently; others do not. Use a carrier lookup API (e.g., Twilio Lookup) for high-value transactional messages to get current carrier data before routing.
Opt-out / STOP handling: Carriers mandate that STOP replies immediately suppress future messages to that number from your short/long code. The gateway must maintain an opt-out list, updated by inbound STOP messages, and check it at intake time. Sending to an opted-out number is a carrier violation that can get your short code suspended.

Scalability Considerations

Throughput: A shortcode supports 100 msg/s per number. For 10,000 msg/s, you need 100 shortcodes. The routing engine must track per-number send rate and not exceed carrier limits. Use Redis sliding window counters (ZADD with score = timestamp, ZCOUNT over window) for per-number rate limiting.
Database growth: At 1M messages/day, sms_messages grows by 365M rows/year. Partition by created_at (monthly). Archive delivered/failed rows older than 90 days to cold storage. Keep only queued/sending rows in hot storage for the retry scheduler.
Webhook ingestion: At 1M DLRs/day, the inbound_webhooks table receives 1M rows/day. The async processor must keep processed = 0 rows near zero to avoid table scan growth. Index on (processed, received_at) and process in small batches with DELETE after processing (or set processed = 1 and run a nightly purge).
Routing engine performance: Routing logic runs on every send request. Cache provider health scores and per-country number pools in Redis with a 10-second TTL. A cold routing decision (full DB query) should take < 5ms; a cached decision < 1ms.
Multi-region compliance: GDPR and local data residency laws may require that message logs for EU numbers are stored in EU-region databases. The gateway must route writes to the correct regional shard based on the destination country.

Summary

An SMS gateway is a routing, queuing, and normalization layer that turns the messy reality of multiple SMS providers, per-country regulations, carrier rate limits, and unpredictable delivery receipts into a clean internal API. The core design choices are: async send with a synchronous external_id response, a cost-and-health routing engine with automatic failover, provider adapters for extensibility, and a raw-payload webhook store for reliable DLR processing. Number masking, opt-out enforcement, and segment-count billing are non-obvious requirements that must be built into the intake and routing layers from the start — retrofitting them onto an existing gateway is expensive and error-prone.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is an SMS gateway and how does it work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An SMS gateway is a service that translates messages between an application’s internal format and the protocols used by mobile carriers (typically SMPP or HTTP-based carrier APIs). An application submits a message with a destination number; the gateway normalizes the payload, selects a carrier route, and transmits the message over SMPP or HTTPS to the carrier’s Short Message Service Center (SMSC), which delivers it to the handset. The gateway also receives delivery receipts from the carrier and relays them back to the originating application via webhook or polling.”
}
},
{
“@type”: “Question”,
“name”: “How does an SMS gateway select between multiple carrier providers?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The gateway maintains a routing table that maps destination number prefixes (country code + network prefix) to one or more carrier connectors. Selection logic typically layers three criteria: cost (cheapest route that meets quality thresholds), quality (delivery rate and latency metrics tracked per route), and failover (if a carrier returns an error or exceeds its latency SLA, traffic shifts to the next-ranked connector automatically). A route manager process refreshes metrics every few minutes from delivery receipt data stored in a time-series database, keeping the routing table current without manual intervention.”
}
},
{
“@type”: “Question”,
“name”: “How are SMS delivery receipts processed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a carrier delivers a message to a handset, it sends a Delivery Report (DLR) back to the gateway, either as an SMPP PDU (deliver_sm) or an HTTP callback depending on the integration type. The gateway’s receipt processor matches each DLR to the original outbound message using the message ID returned at send time, then updates the message record’s status (e.g., DELIVERED, UNDELIVERED, EXPIRED). Processed receipts are published to an internal event queue so that downstream services—billing, analytics, application webhooks—can consume them asynchronously without blocking the receipt ingestion path.”
}
},
{
“@type”: “Question”,
“name”: “How is number masking implemented in an SMS gateway?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Number masking hides the real phone numbers of two parties (e.g., a rider and a driver) by routing SMS through a proxy number pool. At session creation the gateway assigns a virtual number from the pool and stores a mapping of (virtual number, party A real number, party B real number, session expiry) in a fast key-value store like Redis. When party A sends an SMS to the virtual number, the gateway looks up the session, rewrites the destination to party B’s real number, and sends the message with the virtual number as the sender ID, so neither party sees the other’s real number. Sessions expire automatically, returning the virtual number to the pool.”
}
}
]
}