Feature Toggle Service Low-Level Design: Toggle Types, Targeting Rules, and Kill Switch

Why Feature Toggles?

Feature toggles (also called feature flags or feature switches) decouple code deployment from feature release. You ship code to production behind a toggle that is off, then gradually enable it for users without a new deployment. This enables continuous delivery: every commit goes to prod, and business stakeholders control when features become visible. Toggles also provide an instant kill switch during incidents: one change disables a misbehaving feature without a rollback deployment.

Toggle Types

Not all toggles serve the same purpose. Mixing them into one undifferentiated list creates confusion about expected lifecycle and ownership:

Release toggles: Hide incomplete features during development. Short-lived — should be removed once the feature reaches 100% rollout. Owned by the engineering team.
Ops toggles (kill switches): Disable specific functionality during incidents or overload. Can be long-lived. Owned by the on-call team. Must evaluate first in the decision chain.
Experiment toggles: Split traffic for A/B tests. Medium-lived — cleaned up after the experiment concludes and a winner is chosen. Owned by product/data teams.
Permission toggles: Enable features for specific customer segments (enterprise tier, beta users). Long-lived — they reflect product packaging, not a transient state. Owned by product.

Toggle Schema

{
  "toggle_id": "new_checkout_flow",
  "type": "release",
  "status": "active",
  "targeting_rules": [
    {"attribute": "account_type", "operator": "eq", "value": "beta"},
    {"attribute": "country", "operator": "in", "value": ["US", "CA"]}
  ],
  "rollout_percentage": 25,
  "created_by": "eng-team@example.com",
  "updated_at": "2025-10-15T09:00:00Z"
}

The schema captures type and lifecycle context alongside the evaluation rules. status can be active, inactive, or archived. Archived toggles are cleaned up — their code paths deleted.

Targeting Rules

Targeting rules evaluate attributes of the requesting entity (user, account, request context) against conditions. Rule structure: attribute operator value. Supported operators: eq, neq, in, not_in, starts_with, regex. Rules are evaluated in order; the first matching rule determines the outcome. If no rule matches, fall through to percentage rollout or default value.

Examples: route beta users to the new feature (account_type eq beta), restrict to specific markets (country in [US, CA, GB]), enable for enterprise accounts (plan eq enterprise). Attributes come from the evaluation context passed by the calling application — the toggle service itself is stateless with respect to user data.

Percentage Rollout with Consistent Hashing

Percentage rollout must be consistent: the same user must always get the same toggle value, not a random flip on each request. The implementation uses consistent hashing: hash(toggle_id + user_id) mod 100. If the result is less than rollout_percentage, the toggle is enabled for this user. Including toggle_id in the hash input prevents all toggles from splitting traffic at identical boundaries (which would create correlated experiment groups).

Increasing rollout percentage from 10% to 20% enables the toggle for a new consistent 10% slice of users — the original 10% remain enabled. Decreasing percentage disables it for a slice, which is useful for rolling back a gradual rollout without a full kill.

Kill Switch Behavior

Ops toggles function as kill switches: when an ops toggle is set to inactive, it overrides all other toggle types and evaluation rules. The evaluation order must enforce this: kill switch check → targeting rules → percentage rollout → default. An on-call engineer disabling a feature during an incident should not need to understand targeting rules or percentage settings — setting the toggle inactive is sufficient and immediate.

SDK Design and Client-Side Caching

The SDK runs inside the application process. At startup it fetches all toggles for the application's namespace and caches them in memory. A background thread refreshes the cache every 30 seconds via polling, or immediately on push notification from the toggle service. Toggle evaluation is a pure in-memory operation — no network call per toggle check. This is critical: toggles may be evaluated on every request, and a network call per check would add unacceptable latency and create a dependency on toggle service availability.

If the toggle service is unreachable during a refresh, the SDK serves the last known values. This fallback behavior must be documented and tested: toggles should not fail open or closed in undefined ways when the service is down.

Toggle Lifecycle and Debt Prevention

Stale toggles accumulate quickly. The lifecycle is: created → gradual rollout → 100% → archived → code deleted. Each toggle should have a planned removal date set at creation. The toggle service can alert when a toggle has been at 100% for more than 30 days without being archived — this signals toggle debt. Archived toggles should trigger a code review ticket to remove the branching logic. Toggle debt is a form of technical debt: the more live toggles, the harder it is to reason about code behavior and the more combinations need testing.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does consistent hashing ensure the same user always sees the same variant?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The SDK computes hash(featureKey + userId) mod 100 to produce a stable bucket in [0, 100), then compares it against the toggle's configured rollout threshold; any user whose bucket falls below the threshold receives the treatment variant deterministically without server-side state. Because the hash is a pure function of the user ID and feature key, the assignment is identical across every application instance and every SDK invocation.”
}
},
{
“@type”: “Question”,
“name”: “What is a kill switch toggle and when is it used?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A kill switch is a boolean toggle that disables a feature for 100% of users in one atomic operation, typically used when a newly launched feature causes elevated error rates or latency regressions that require an immediate rollback faster than a code deployment. Unlike a gradual rollout toggle, a kill switch has no percentage or targeting rules — it is binary and designed for operational emergencies.”
}
},
{
“@type”: “Question”,
“name”: “How are feature toggles cleaned up to prevent toggle debt?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each toggle is created with a mandatory expiry date stored in the toggle service metadata; a CI linting job fails the build if it detects a reference to an expired toggle key in the codebase, forcing the team to either delete the toggle code path or extend the expiry with justification. This automated enforcement prevents the accumulation of dead code branches that degrade readability and increase the combinatorial complexity of testing.”
}
},
{
“@type”: “Question”,
“name”: “How does the SDK handle toggle service unavailability?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The SDK maintains a local in-process cache of the last successfully fetched toggle configuration and continues serving evaluations from that cache during a toggle service outage, falling back to compiled-in default values if no cache entry exists. The cache is populated on startup via a synchronous bootstrap fetch with a short timeout, after which background refresh keeps it current so that a transient outage does not affect running application instances.”
}
}
]
}