Question 1

How do you define degradation tiers and what should each tier preserve?

Accepted Answer

Design a tiered degradation ladder before any incident occurs. Example for an e-commerce site — Tier 0 (full): all features operational. Tier 1 (recommendations down): product search, catalog, checkout functional; recommendations replaced with bestsellers. Tier 2 (search degraded): return cached search results or pre-computed top results; no real-time inventory check. Tier 3 (catalog from cache): serve product pages entirely from cache; disable add-to-cart if inventory service is down. Tier 4 (static fallback): serve a static maintenance page with the phone number for placing orders. Define each tier in a runbook with: what is disabled, what is the user impact, which service failure triggers it, and how to recover. The key principle: each tier must preserve the single most important user action — for e-commerce, that is completing a purchase.

Question 2

How does the circuit breaker pattern enable automatic graceful degradation?

Accepted Answer

The circuit breaker wraps calls to a dependency and tracks failure rate over a rolling window. When failures exceed a threshold (e.g., 50% of calls in 60 seconds fail), the breaker OPENS — subsequent calls fail immediately without attempting the network call, returning a cached result or error response instead. After a configurable timeout (e.g., 30 seconds), the breaker enters HALF-OPEN and allows one test call. If it succeeds, the breaker CLOSES and normal operation resumes. If it fails, the breaker stays OPEN. This prevents cascading failures: without a circuit breaker, slow or failing dependency calls hold threads/connections, exhaust the thread pool, and take down the caller. The circuit breaker makes degradation explicit and automatic rather than causing silent cascade.

Question 3

How do you use feature flags as kill switches for graceful degradation?

Accepted Answer

Every non-critical feature should have a feature flag that can be disabled in under 30 seconds without a deployment. Store flags in a configuration service (LaunchDarkly, Unleash, or a simple Redis key). The application checks the flag on each request: if recommendations_enabled is false, skip the recommendation service call and return an empty array. During an incident, an on-call engineer flips the flag — all instances see the change within seconds (if polling Redis with a 10-second TTL). This is faster and safer than deploying a code change under incident pressure. Flags provide targeted degradation: disable only the affected feature, not the entire application. Maintain a canonical list of kill switches with their expected fallback behavior for each.

Question 4

How do you serve stale cache content when the origin is down?

Accepted Answer

Configure Cache-Control: stale-if-error=3600 in API responses — CDN and browsers will serve stale cached content for up to 1 hour if the origin returns a 5xx error. For internal service caches (Redis): on a cache miss that results in an error from the upstream service, return the last-known cached value (even if expired) rather than propagating the error. Store cache entries with two TTLs: a primary TTL (normal expiry, e.g., 5 minutes) and an extended stale TTL (e.g., 24 hours). On primary TTL expiry, attempt to refresh; if the refresh fails, serve the stale value until the extended TTL expires. This makes the service resilient to brief upstream outages. Alert separately on stale cache hits so engineers know degradation is occurring.

Question 5

How do you monitor that degradation is happening and recovery has occurred?

Accepted Answer

Instrument each fallback path explicitly. When the recommendations service is down and you return bestsellers, emit a metric: degradation.recommendations.fallback_count++. Dashboard these metrics alongside error rates. Two signals matter: (1) Fallback activation rate — if recommendations fallback is firing on >1% of requests, something is wrong even if users are not seeing errors. (2) Recovery detection — monitor the circuit breaker state per dependency. Alert when a circuit breaker OPENS (degradation started) and when it CLOSES (recovery confirmed). Without explicit fallback metrics, degradation can persist undetected: users see degraded experience, no errors are logged (because the fallback succeeded), and the on-call team does not know a dependency is down.

Graceful Degradation Low-Level Design

Graceful Degradation — Low-Level Design

Degradation Tiers

Fallback Chain Implementation

Feature Flags for Degradation Control

Read-Through Cache as Degradation Buffer

Degradation vs Error Monitoring

Key Interview Points