Feature flags (feature toggles) decouple code deployment from feature release. Code ships to production with a feature disabled; the flag is enabled when ready — without a new deployment. This enables dark launches (test with real traffic before release), gradual rollouts (enable for 1%, 10%, 100% of users), A/B testing (different behavior for different user cohorts), and instant kill switches (disable a broken feature in seconds without a rollback deployment).
Flag Storage and Evaluation
Flags are stored in a central configuration store (database, Redis, or a dedicated service like LaunchDarkly or GrowthBook). Flag evaluation: given a flag name and a user context (user_id, email, account_type, region), return true or false. Evaluation must be fast (< 1ms) since it may happen on every request. Client-side evaluation: the SDK downloads the full flag ruleset at startup and evaluates locally (no network call per evaluation). Server-side evaluation: the SDK calls the flag service per evaluation — slower but allows real-time rule changes without SDK restart.
Targeting Rules
Targeting rules control who sees a feature. Rule types: user allowlist (enable for specific user IDs — internal testers), attribute match (enable for users where account_type=”enterprise”), percentage rollout (enable for X% of users, determined by hash(user_id + flag_name) mod 100 — ensures sticky assignment: the same user always gets the same value), and environment (enable in staging, disable in production). Rules are evaluated in order; the first matching rule determines the flag value. A default rule (fallback) applies when no targeting rules match.
Gradual Rollout
Gradual rollout reduces risk by exposing a new feature to increasing percentages of users over time. Start at 1% — monitor error rate, latency, and business metrics. Increase to 5%, 10%, 25%, 50%, 100% if metrics remain healthy. Use hash-based assignment (hash(user_id) mod 100) to ensure users have a consistent experience throughout the rollout — a user in the 1% cohort stays in the feature as rollout increases to 100%. If metrics degrade, reduce the percentage instantly. Rollback is a config change (seconds) rather than a deployment (minutes to hours).
Flag Types
Boolean flags: on/off. Simplest; use for feature gates. String flags: return a string value (e.g., button_color: “blue” vs “green” for A/B testing). Number flags: return a numeric value (e.g., timeout_ms: 1000 vs 2000). JSON flags: return a structured object (e.g., feature configuration with multiple parameters). Multivariate flags: A/B/C/N testing with multiple variants and traffic allocation per variant. Use the simplest flag type that meets the need; JSON flags are powerful but harder to audit and reason about.
Flag Lifecycle and Technical Debt
Flags accumulate over time and become technical debt. Each flag is a code branch that must be maintained. Establish a lifecycle: short-lived flags (release flags, experiment flags) are removed after the feature is fully rolled out or the experiment concludes — typically within 1-4 weeks. Long-lived flags (ops toggles, permission flags) persist indefinitely but are documented as intentional. Assign an owner and expiry date to every flag at creation. Add lint checks that fail the build if a flag has been 100%-enabled for more than N days without code removal.
Bootstrapping and SDK Initialization
On application startup, the flag SDK downloads the current ruleset from the flag service. Until initialization completes, flags must have a default value (fail-safe). For server-rendered pages, the SDK should block startup until flags are loaded (or time out after 500ms and use defaults). For client-side SDKs, bootstrap the SDK with server-side flag values embedded in the HTML to prevent a flash of default content while the SDK initializes. Cache the ruleset in Redis with a short TTL (30-60 seconds) so changes propagate quickly without the SDK polling continuously.
Metrics and Experimentation
Feature flags power A/B testing: assign users to control (flag off) and treatment (flag on) cohorts using hash-based assignment. Log flag exposures: record (user_id, flag_name, variant, timestamp) for every evaluation. Join flag exposures with business metric events (purchase, signup, engagement) to measure the effect of the feature. Use a statistics framework (frequentist t-test, Bayesian analysis) to determine if observed metric differences are statistically significant. Minimum detectable effect and sample size calculations determine how long the experiment must run before drawing conclusions.