Feature Toggle Service Low-Level Design: Toggle Types, Evaluation Order, and Kill Switch

What Is a Feature Toggle Service?

A feature toggle service stores toggle definitions, evaluates them against a user and request context at runtime, enforces evaluation order rules across toggle types, provides a kill switch to disable functionality instantly without a deployment, and maintains a full audit history of every toggle change. It decouples feature rollout from code deployment and is used to safely release, experiment with, and operate features across large user populations.

Requirements

Functional Requirements

  • Support three toggle types: release (gradual rollout by percentage), experiment (A/B test variant gate), and ops (on/off operational controls).
  • Evaluate toggles for a given user and request context, applying targeting rules (user segment, device, region).
  • Enforce evaluation order: ops toggles override release toggles, which override experiment toggles.
  • Provide a kill switch: an ops toggle that immediately disables a feature for 100% of traffic.
  • Log every toggle change with actor, timestamp, and before/after values.
  • Support local SDK caching so toggle evaluation requires no network call in the hot path.

Non-Functional Requirements

  • Toggle evaluation under 1 ms in the client SDK (local cache).
  • Config changes propagated to all SDK instances within 10 seconds.
  • Audit history retained indefinitely; queryable by toggle name and date.

Data Model

Toggle

  • toggle_id UUID — primary key.
  • name VARCHAR — unique human-readable identifier used in code.
  • toggle_type ENUM: RELEASE, EXPERIMENT, OPS.
  • enabled boolean — master switch; false means toggle is off for everyone.
  • rollout_percentage TINYINT — for RELEASE toggles; 0-100.
  • targeting_rules JSONB — array of rule objects with field, operator, and value.
  • version INTEGER — incremented on every change for cache invalidation.

ToggleAuditEntry

  • entry_id UUID, toggle_id FK.
  • changed_by user ID of the operator.
  • change_type ENUM: CREATED, UPDATED, DELETED.
  • before_state, after_state JSONB — full snapshots of the toggle record.
  • changed_at timestamp.

Core Algorithms

Evaluation Order

When evaluating whether a user has access to a feature, the service processes toggle types in strict priority order. First, OPS toggles are evaluated: if any OPS toggle for the feature has enabled=false, evaluation short-circuits and returns disabled regardless of other toggles. This is the kill switch path. Next, RELEASE toggles are evaluated against rollout percentage and targeting rules. Finally, EXPERIMENT toggles are evaluated to gate feature variants. This ordering ensures that operational controls can never be overridden by experiment or release logic.

Targeting Rule Evaluation

Each targeting rule specifies a context field (e.g., user.country, device.type), an operator (IN, NOT_IN, EQUALS, MATCHES_REGEX), and a value set. Rules within a toggle are combined with AND logic by default; OR grouping is supported via a rule group field. The SDK evaluates rules entirely from the locally cached toggle definition against the request context object, requiring no server round-trip. Unsupported operators default to false to fail safe.

Kill Switch

A kill switch is simply an OPS toggle with enabled=false and no targeting rules. Setting it to false via the admin API writes the change to the database, appends an audit entry, publishes a TOGGLE_UPDATED event to a pub/sub channel, and increments the toggle version. SDK instances subscribed to the channel update their local cache within milliseconds, propagating the kill switch across the fleet in seconds without any deployment.

API Design

  • GET /v1/toggles — returns all toggle definitions; used by SDK on initial load and periodic refresh.
  • GET /v1/toggles/{name} — returns a single toggle definition by name.
  • POST /v1/toggles — create a new toggle; body includes name, type, rules, rollout_percentage.
  • PATCH /v1/toggles/{name} — update a toggle (change rules, percentage, enabled state).
  • POST /v1/toggles/{name}/evaluate — server-side evaluation for contexts where SDK is not available; body: user context object.
  • GET /v1/toggles/{name}/audit — paginated audit history for a toggle.

Scalability and SDK Architecture

Client SDK

The SDK loads the full toggle configuration on initialization via the GET /v1/toggles endpoint and stores it in a local in-memory map. It subscribes to a server-sent events stream at GET /v1/toggles/stream for real-time push updates. On receiving a TOGGLE_UPDATED event, the SDK fetches only the changed toggle by name and updates its local map. This hybrid pull-then-push model ensures the SDK always has a fresh config without polling, and falls back gracefully to periodic polling if the SSE connection drops.

Scalability

The toggle service itself is stateless; all toggle state is in PostgreSQL. The GET /v1/toggles response is served from a Redis cache with a 5-second TTL, making the initial SDK load fast even under heavy traffic. The SSE stream endpoint is handled by a lightweight long-polling server that fans out toggle change events from Redis pub/sub to all connected SDK instances, scaling to tens of thousands of concurrent SDK connections per service instance.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the toggle type taxonomy for a feature toggle system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Feature toggles fall into three main types: release toggles (short-lived, gate incomplete features until ready to ship), experiment toggles (tied to A/B tests, drive variant assignment), and ops toggles (long-lived operational controls like kill switches and load-shedding flags). Each type has different lifecycle expectations—release toggles should be removed after launch, while ops toggles may be permanent infrastructure.”
}
},
{
“@type”: “Question”,
“name”: “How does evaluation order work with an explicit DENY rule?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The evaluator processes rules in priority order: explicit DENY rules are checked first and short-circuit evaluation, returning off regardless of other matching rules. After DENY, explicit ALLOW rules are evaluated in order. A default value (on or off) applies when no rules match. This order ensures that blocklisting specific users or environments always takes precedence over broader allow rules, which is critical for safety and compliance use cases.”
}
},
{
“@type”: “Question”,
“name”: “What are the semantics of a kill switch in a feature toggle system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A kill switch is an ops toggle configured to disable a feature for all users instantly, without a deployment. It is evaluated first in the rule chain and, when activated, returns off for 100% of traffic. Kill switches must propagate to all service instances within seconds (push or short-poll interval), so they're typically backed by a distributed config store (etcd, Consul, or a CDN-cached config endpoint) with aggressive TTLs. Services must fail open or closed predictably when the toggle service is unreachable.”
}
},
{
“@type”: “Question”,
“name”: “How is audit history retained for feature toggle changes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Every mutation to a toggle's configuration (create, update, delete, enable, disable) is written to an immutable audit log with actor identity, timestamp, before/after state, and an optional change reason. The audit log is append-only (stored in a separate table or event stream) and retained per compliance policy (commonly 1–7 years). A UI surfaces the audit trail per toggle, enabling teams to correlate incidents with config changes.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Atlassian Interview Guide

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top