Why Feature Toggles?
Feature toggles (also called feature flags or feature switches) decouple code deployment from feature release. You ship code to production behind a toggle that is off, then gradually enable it for users without a new deployment. This enables continuous delivery: every commit goes to prod, and business stakeholders control when features become visible. Toggles also provide an instant kill switch during incidents: one change disables a misbehaving feature without a rollback deployment.
Toggle Types
Not all toggles serve the same purpose. Mixing them into one undifferentiated list creates confusion about expected lifecycle and ownership:
- Release toggles: Hide incomplete features during development. Short-lived — should be removed once the feature reaches 100% rollout. Owned by the engineering team.
- Ops toggles (kill switches): Disable specific functionality during incidents or overload. Can be long-lived. Owned by the on-call team. Must evaluate first in the decision chain.
- Experiment toggles: Split traffic for A/B tests. Medium-lived — cleaned up after the experiment concludes and a winner is chosen. Owned by product/data teams.
- Permission toggles: Enable features for specific customer segments (enterprise tier, beta users). Long-lived — they reflect product packaging, not a transient state. Owned by product.
Toggle Schema
{
"toggle_id": "new_checkout_flow",
"type": "release",
"status": "active",
"targeting_rules": [
{"attribute": "account_type", "operator": "eq", "value": "beta"},
{"attribute": "country", "operator": "in", "value": ["US", "CA"]}
],
"rollout_percentage": 25,
"created_by": "eng-team@example.com",
"updated_at": "2025-10-15T09:00:00Z"
}
The schema captures type and lifecycle context alongside the evaluation rules. status can be active, inactive, or archived. Archived toggles are cleaned up — their code paths deleted.
Targeting Rules
Targeting rules evaluate attributes of the requesting entity (user, account, request context) against conditions. Rule structure: attribute operator value. Supported operators: eq, neq, in, not_in, starts_with, regex. Rules are evaluated in order; the first matching rule determines the outcome. If no rule matches, fall through to percentage rollout or default value.
Examples: route beta users to the new feature (account_type eq beta), restrict to specific markets (country in [US, CA, GB]), enable for enterprise accounts (plan eq enterprise). Attributes come from the evaluation context passed by the calling application — the toggle service itself is stateless with respect to user data.
Percentage Rollout with Consistent Hashing
Percentage rollout must be consistent: the same user must always get the same toggle value, not a random flip on each request. The implementation uses consistent hashing: hash(toggle_id + user_id) mod 100. If the result is less than rollout_percentage, the toggle is enabled for this user. Including toggle_id in the hash input prevents all toggles from splitting traffic at identical boundaries (which would create correlated experiment groups).
Increasing rollout percentage from 10% to 20% enables the toggle for a new consistent 10% slice of users — the original 10% remain enabled. Decreasing percentage disables it for a slice, which is useful for rolling back a gradual rollout without a full kill.
Kill Switch Behavior
Ops toggles function as kill switches: when an ops toggle is set to inactive, it overrides all other toggle types and evaluation rules. The evaluation order must enforce this: kill switch check → targeting rules → percentage rollout → default. An on-call engineer disabling a feature during an incident should not need to understand targeting rules or percentage settings — setting the toggle inactive is sufficient and immediate.
SDK Design and Client-Side Caching
The SDK runs inside the application process. At startup it fetches all toggles for the application's namespace and caches them in memory. A background thread refreshes the cache every 30 seconds via polling, or immediately on push notification from the toggle service. Toggle evaluation is a pure in-memory operation — no network call per toggle check. This is critical: toggles may be evaluated on every request, and a network call per check would add unacceptable latency and create a dependency on toggle service availability.
If the toggle service is unreachable during a refresh, the SDK serves the last known values. This fallback behavior must be documented and tested: toggles should not fail open or closed in undefined ways when the service is down.
Toggle Lifecycle and Debt Prevention
Stale toggles accumulate quickly. The lifecycle is: created → gradual rollout → 100% → archived → code deleted. Each toggle should have a planned removal date set at creation. The toggle service can alert when a toggle has been at 100% for more than 30 days without being archived — this signals toggle debt. Archived toggles should trigger a code review ticket to remove the branching logic. Toggle debt is a form of technical debt: the more live toggles, the harder it is to reason about code behavior and the more combinations need testing.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Atlassian Interview Guide