Low Level Design: LLM Gateway Service

What Is an LLM Gateway?

An LLM gateway is a service layer that sits between your application and one or more large language model providers (OpenAI, Anthropic, Google Gemini, Cohere, etc.). It centralizes authentication, routing, quota enforcement, cost tracking, and reliability logic so that application teams consume a single internal API regardless of which model is actually serving a request.

This is a common low level design question in AI infrastructure interviews. Candidates are expected to walk through the core subsystems, data models, and operational concerns in detail.

Requirements

Functional

  • Unified API surface: clients call one endpoint regardless of provider.
  • Provider abstraction: swap or add providers without changing client code.
  • Request routing: route by model name, cost tier, latency target, or policy.
  • Token quota management: enforce per-user, per-team, and global token budgets.
  • Streaming response handling: proxy server-sent events (SSE) from providers to clients.
  • Cost tracking: record token usage and dollar cost per request.
  • Fallback logic: retry on provider error, fall back to alternate provider.

Non-Functional

  • Low added latency (target < 5 ms gateway overhead for non-streaming).
  • High availability (99.9% uptime independent of any single provider).
  • Horizontal scalability for thousands of concurrent streaming connections.
  • Audit log for compliance and debugging.

High-Level Architecture

Client
  |
  v
[API Gateway / Load Balancer]
  |
  v
[LLM Gateway Service]
  |-- Auth & Quota Middleware
  |-- Router
  |-- Provider Adapter Layer
  |      |-- OpenAI Adapter
  |      |-- Anthropic Adapter
  |      |-- Gemini Adapter
  |-- Streaming Proxy
  |-- Cost Tracker (async)
  |-- Audit Logger (async)
  |
  v
[Provider APIs]

Core Data Models

Request

{
  request_id: UUID,
  client_id: string,
  model: string,          // e.g. "gpt-4o", "claude-3-5-sonnet"
  messages: [...],
  max_tokens: int,
  stream: bool,
  metadata: map
}

ProviderConfig

{
  provider_id: string,
  name: string,
  base_url: string,
  api_key_secret: string,   // reference to secrets manager
  models: [string],
  cost_per_1k_input_tokens: float,
  cost_per_1k_output_tokens: float,
  rate_limit_rpm: int,
  enabled: bool
}

QuotaRecord

{
  entity_id: string,        // user_id or team_id
  entity_type: enum('user', 'team', 'global'),
  period: enum('minute', 'day', 'month'),
  token_limit: long,
  token_used: long,
  cost_limit_usd: float,
  cost_used_usd: float,
  reset_at: timestamp
}

UsageRecord

{
  request_id: UUID,
  client_id: string,
  provider_id: string,
  model: string,
  input_tokens: int,
  output_tokens: int,
  cost_usd: float,
  latency_ms: int,
  status: enum('success', 'error', 'fallback'),
  created_at: timestamp
}

Provider Abstraction Layer

Each provider is wrapped in an adapter that implements a common interface:

interface LLMProvider {
  complete(request: NormalizedRequest): Response
  stream(request: NormalizedRequest): Stream<Chunk>
  countTokens(messages: []Message): int
  isHealthy(): bool
}

The adapter translates the internal normalized request format into the provider-specific API payload, handles authentication headers, and normalizes error codes back to a common error taxonomy (rate_limit, context_length, auth_error, server_error).

Request Routing

The router selects a provider+model combination based on a routing policy attached to the request or the calling client:

  • Explicit model routing: client specifies model name, gateway resolves to provider.
  • Cost-optimized routing: gateway selects cheapest capable provider for the task.
  • Latency-optimized routing: gateway selects provider with lowest p95 latency from recent health metrics.
  • A/B routing: percentage split between providers for evaluation.

Routing rules are stored in a config table and hot-reloaded via a watch mechanism (e.g., etcd watch or polling a config service every 30 seconds).

Token Quota Management

Quota enforcement runs as middleware before the request is dispatched to a provider.

  1. Load quota record for client_id (from Redis with TTL aligned to quota period).
  2. Pre-check: estimate input tokens using a lightweight tokenizer. If projected usage would exceed limit, reject with 429.
  3. Reserve tokens optimistically (increment counter before dispatch).
  4. After response: adjust reservation with actual token counts reported by provider.
  5. On error: release the reservation.

Using Redis atomic INCR + EXPIRE for the counter avoids double-spending under concurrent requests. A sliding window or fixed window strategy is chosen per quota tier.

Streaming Response Handling

Streaming is the latency-sensitive path. The gateway must not buffer the full response before forwarding.

  • Open an HTTP/2 or chunked-transfer connection to the provider.
  • As SSE chunks arrive, forward them directly to the client over a persistent connection.
  • Accumulate token counts from streamed chunks for quota and cost accounting.
  • On stream completion or error, write the final UsageRecord asynchronously.
  • Apply a stream timeout (e.g., 120 s) to prevent zombie connections.

The streaming proxy is implemented as an async event loop (Node.js, Go goroutines, or Python asyncio) rather than a thread-per-request model to support high concurrency.

Cost Tracking

Cost is computed post-response using the formula:

cost = (input_tokens / 1000) * cost_per_1k_input
     + (output_tokens / 1000) * cost_per_1k_output

Cost records are written to a time-series store (ClickHouse or BigQuery) via an async message queue (Kafka or SQS) to avoid adding latency to the hot path. A daily aggregation job rolls up per-request records into per-client cost summaries for billing and alerting.

Fallback Logic

Fallback is triggered by transient provider errors (5xx, rate limit 429, timeout):

  1. Attempt primary provider. On failure, classify error.
  2. For retryable errors: retry primary up to N times with exponential backoff.
  3. After max retries: select next provider from the fallback chain defined in routing config.
  4. Record fallback event in UsageRecord for observability.
  5. Circuit breaker per provider: if error rate exceeds threshold, mark provider as degraded and skip it in routing for a cooldown period.

API Design

POST /v1/chat/completions
Authorization: Bearer <client_token>
Content-Type: application/json

{
  "model": "gpt-4o",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": true,
  "max_tokens": 512
}

The gateway exposes an OpenAI-compatible API surface so existing client SDKs work without modification. Provider-specific extensions are passed via an optional x-gateway-options header.

Observability

  • Metrics: request rate, error rate, p50/p95/p99 latency, token throughput, cost per minute — all broken down by provider, model, and client.
  • Tracing: distributed trace per request with spans for auth, quota check, provider call, and response forwarding.
  • Audit log: immutable log of every request (request_id, client_id, model, token counts, cost, outcome) written to append-only storage.
  • Alerting: PagerDuty alerts on provider error rate spike, quota exhaustion approaching, and cost anomaly.

Scale Considerations

  • Stateless gateway instances scale horizontally behind a load balancer.
  • Quota counters live in Redis Cluster; sharded by entity_id.
  • Provider health metrics aggregated in a shared cache (Redis or memcached) updated by a background health-check loop.
  • Usage records written to Kafka; consumers write to ClickHouse for analytics.
  • Config hot-reload avoids restarts for routing rule changes.

Common Interview Follow-Ups

  • How do you handle prompt injection filtering? Add an input validation middleware layer before routing; integrate with a content safety service.
  • How do you support multi-modal requests? Extend NormalizedRequest to include image/audio parts; adapters handle provider-specific encoding.
  • How do you prevent quota gaming via parallel requests? Use Redis INCR reservation before dispatch rather than post-hoc accounting.
  • How do you support fine-tuned models? ProviderConfig includes a models registry; fine-tuned model IDs are registered with their base provider.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is an LLM gateway service and why is it needed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An LLM gateway is a centralized proxy layer that sits between your application and multiple large language model providers such as OpenAI, Anthropic, or Google. It is needed to provide a single unified API surface, enforce rate limits and quotas, handle retries and failover, and give visibility into usage and costs across all providers from one place.”
}
},
{
“@type”: “Question”,
“name”: “How does provider abstraction and routing work in an LLM gateway?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The gateway maps a canonical request format to provider-specific request schemas at runtime. A routing layer evaluates rules based on model capability, latency SLOs, cost per token, and current availability to select the best backend. Clients call one endpoint and the gateway transparently translates and forwards the request, normalizing the response back to the canonical format.”
}
},
{
“@type”: “Question”,
“name”: “How does token quota management work in an LLM gateway?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each tenant or API key is assigned a token budget tracked in a fast store such as Redis using sliding-window or token-bucket counters. Before forwarding a request the gateway estimates token consumption from prompt length and the requested max_tokens. If the estimate would exceed the quota the request is rejected or queued. Actual usage reported in the provider response reconciles the estimate after completion.”
}
},
{
“@type”: “Question”,
“name”: “How does an LLM gateway handle streaming responses?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “For streaming, the gateway opens a server-sent events or chunked HTTP connection to the provider and pipes tokens downstream to the client as they arrive. It accumulates the full token count from streamed chunks to update quota counters once the stream closes. Middleware in the streaming path can apply content filtering or logging without buffering the entire response.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top