Q: How do you implement a kill switch that takes effect within seconds, not minutes?

The Redis cache has a 60-second TTL — setting rollout_pct=0 takes up to 60 seconds to propagate if only the DB is updated. For emergency kill switches where a bug is causing production errors: (1) write the change to Postgres (durable); (2) immediately delete the Redis cache key (redis.delete("flag:feature_key")). The next evaluation for any user will miss cache, read Postgres (status='off'), re-populate cache with the new value. Kill switch is effective within one cache-miss cycle — typically <100ms. For even faster propagation: use Redis pub/sub to broadcast a "flag_invalidated" message to all application servers. Each server subscribes and purges its local in-process flag cache (if any). Sub-second propagation across the fleet.

Q: How do override rules interact with percentage rollouts for canary deployments?

Override priority: user override > org override > country override > plan override > rollout bucket. This enables the canonical canary pattern: (1) enable for the engineering team via org override (enabled=True for org_id=engineering_org); (2) enable for beta users via user overrides; (3) once validated, advance to 1% rollout for general users. Beta users and engineers always see the feature regardless of rollout_pct — they are explicitly overridden. Non-beta users get the deterministic bucket treatment. This means you can have rollout_pct=1 but effectively 5% coverage because beta users and specific orgs are all overridden. Track "actual reach" separately in the exposure log: SELECT COUNT(DISTINCT user_id) FROM FlagExposureLog WHERE feature_key='X' AND enabled=TRUE AND date=TODAY.

Question 1

Why use MD5 hash bucketing instead of a random assignment for percentage rollouts?

Accepted Answer

Random assignment (Math.random() < 0.1 for 10% rollout) is non-deterministic: a user who loads the page twice in the same session may get different results — the button is green on first load, blue on refresh. This "flickering" is a terrible user experience and pollutes experiment metrics (the user is counted in both groups). MD5(feature_key + user_id) % 100 is deterministic: the same user_id always hashes to the same bucket for the same feature_key. No database lookup needed for evaluation — the hash is computed in-memory in microseconds. Sticky assignment is a core property: once a user is in the 10% bucket, they stay there as rollout advances from 10% to 20% — bucket 9 users who were included at 10% are still included at 20% (buckets 0-19), so existing users are never "demoted" out of a rollout.

Question 2

How do you implement a kill switch that takes effect within seconds, not minutes?

Accepted Answer

The Redis cache has a 60-second TTL — setting rollout_pct=0 takes up to 60 seconds to propagate if only the DB is updated. For emergency kill switches where a bug is causing production errors: (1) write the change to Postgres (durable); (2) immediately delete the Redis cache key (redis.delete("flag:feature_key")). The next evaluation for any user will miss cache, read Postgres (status='off'), re-populate cache with the new value. Kill switch is effective within one cache-miss cycle — typically <100ms. For even faster propagation: use Redis pub/sub to broadcast a "flag_invalidated" message to all application servers. Each server subscribes and purges its local in-process flag cache (if any). Sub-second propagation across the fleet.

Question 3

How do override rules interact with percentage rollouts for canary deployments?

Accepted Answer

Override priority: user override > org override > country override > plan override > rollout bucket. This enables the canonical canary pattern: (1) enable for the engineering team via org override (enabled=True for org_id=engineering_org); (2) enable for beta users via user overrides; (3) once validated, advance to 1% rollout for general users. Beta users and engineers always see the feature regardless of rollout_pct — they are explicitly overridden. Non-beta users get the deterministic bucket treatment. This means you can have rollout_pct=1 but effectively 5% coverage because beta users and specific orgs are all overridden. Track "actual reach" separately in the exposure log: SELECT COUNT(DISTINCT user_id) FROM FlagExposureLog WHERE feature_key='X' AND enabled=TRUE AND date=TODAY.

Question 4

How do you run a gradual rollout that automatically advances based on error rate?

Accepted Answer

Automated progressive delivery: a background job monitors error metrics and advances (or halts) the rollout. Implementation: every 15 minutes, the rollout automation queries the error rate in the new bucket vs. the control bucket. SELECT SUM(error_count)/SUM(request_count) AS error_rate FROM RequestMetrics WHERE feature_flag='X' AND enabled=TRUE AND created_at > NOW()-INTERVAL '15m'. If error_rate < threshold (e.g. <0.1% for a previously <0.05% baseline): advance rollout_pct by 10%. If error_rate > kill_threshold (e.g. >1%): call kill_switch() immediately. This is the Kubernetes canary analysis pattern — Argo Rollouts and Flagger implement this in CI/CD. In a bespoke system, the automation job is a simple cron process that reads flag config, reads metrics, and calls set_rollout(). Alert the on-call engineer on any automated kill switch.

Question 5

How do you clean up stale feature flags that were shipped 100% and never removed?

Accepted Answer

Flags accumulate: a codebase with 6 months of development can have 200+ flags, most of which are shipped (status=on) and the code never cleaned up. These dead flags add evaluation overhead and confusion. Cleanup process: (1) in code, replace is_enabled('flag_key', ctx) call sites with the hardcoded True/False once the flag is fully shipped; (2) delete the FeatureOverride rows; (3) delete the Feature row; (4) delete the Redis cache entry. Automate detection: any flag with status=on and no evaluation events in the last 30 days is a candidate for removal. Report: SELECT feature_key FROM Feature WHERE status='on' AND feature_key NOT IN (SELECT DISTINCT feature_key FROM FlagExposureLog WHERE evaluated_at > NOW()-INTERVAL '30d'). Alert the team that owns the flag.

Feature Rollout System Low-Level Design: Flag Evaluation, Percentage Rollout, and Kill Switches

Feature Rollout System: Low-Level Design

Core Data Model

Flag Evaluation Algorithm

Gradual Rollout API

SDK Usage in Application Code

Observability: Exposure Logging

Key Design Decisions