Graceful Degradation Low-Level Design

Graceful Degradation — Low-Level Design

Graceful degradation keeps a system functional when individual components fail, returning reduced-quality responses rather than errors. It is the implementation layer beneath circuit breakers and fallbacks: what do you actually return when the recommendation service is down? This design is asked at Netflix, Amazon, and any company operating complex distributed systems.

Degradation Tiers

Define explicit tiers of response quality per feature:

Feature: Product recommendations
  Tier 1 (full): personalized ML recommendations from recommendation service
  Tier 2 (degraded): cached recommendations from 1 hour ago (Redis)
  Tier 3 (minimal): top-selling items from database (static query)
  Tier 4 (none): empty recommendations section (hide the widget entirely)

Feature: User profile
  Tier 1: full profile with all computed fields
  Tier 2: basic profile from cache (may be slightly stale)
  Tier 3: minimal profile (name, avatar only) from fast DB query

Never: return a 500 Internal Server Error to the user for non-critical features.

Fallback Chain Implementation

def get_recommendations(user_id, limit=10):
    """Returns recommendations, degrading gracefully through tiers."""

    # Tier 1: personalized recommendations
    try:
        recs = recommendation_service.get(user_id, limit=limit, timeout=200)
        if recs:
            return {'source': 'personalized', 'items': recs}
    except (TimeoutError, ServiceUnavailable):
        metrics.increment('recommendations.tier1.miss')

    # Tier 2: cached recommendations
    cached = redis.get(f'recs:cached:{user_id}')
    if cached:
        metrics.increment('recommendations.tier2.hit')
        return {'source': 'cached', 'items': json.loads(cached)}

    # Tier 3: top-selling items (always available if DB is up)
    try:
        top_items = db.execute("""
            SELECT product_id FROM ProductStats
            ORDER BY sales_last_7d DESC LIMIT %(lim)s
        """, {'lim': limit}).fetchall()
        if top_items:
            metrics.increment('recommendations.tier3.hit')
            return {'source': 'popular', 'items': [r.product_id for r in top_items]}
    except DatabaseError:
        metrics.increment('recommendations.tier3.miss')

    # Tier 4: empty (hide widget)
    metrics.increment('recommendations.tier4.empty')
    return {'source': 'unavailable', 'items': []}

Feature Flags for Degradation Control

-- During an incident: disable expensive features to shed load

def handle_request(user_id):
    page = {}

    # Check feature flags for each section
    if feature_flag('show_recommendations', user_id):
        page['recommendations'] = get_recommendations(user_id)
    else:
        page['recommendations'] = []  # Disabled: reduce load on recommendation service

    if feature_flag('show_social_feed', user_id):
        page['feed'] = get_social_feed(user_id)
    # else: omit the section entirely

    if feature_flag('show_ads', user_id):
        page['ads'] = get_ads(user_id)

    return page

-- Incident runbook: when DB hits 90% CPU:
--   1. Disable 'show_ads' (removes 30% of DB queries)
--   2. Disable 'show_recommendations' (removes another 20%)
--   3. Switch 'show_social_feed' to cached-only mode
-- Each kill switch has a known impact percentage documented in the runbook

Read-Through Cache as Degradation Buffer

def get_user_profile(user_id):
    cache_key = f'profile:{user_id}'

    # Serve stale if DB is unavailable
    cached = redis.get(cache_key)

    try:
        profile = db.execute(
            "SELECT * FROM User WHERE id=%(id)s", {'id': user_id}
        ).first()
        # Refresh cache with longer TTL to buffer future DB outages
        redis.setex(cache_key, 3600, json.dumps(profile))
        return profile
    except DatabaseError:
        if cached:
            metrics.increment('profile.served_stale')
            return json.loads(cached)  # Serve stale during DB outage
        raise  # No cache, cannot degrade

-- Set Cache-Control: stale-if-error=86400 on HTTP responses
-- CDN serves the cached response for up to 24 hours if origin returns 5xx

Degradation vs Error Monitoring

-- Track degradation events to detect when tier 1 is failing at scale

METRICS TO MONITOR:
  recommendations.tier1.hit       → should be >95%
  recommendations.tier2.hit       → alarm if >5% (tier1 failing)
  recommendations.tier3.hit       → alarm if >2% (both tier1 and tier2 failing)
  recommendations.tier4.empty     → page alert (full degradation)

-- Dashboard: stacked area chart of recommendation source distribution over time
-- A sudden shift from tier1 to tier2/3 indicates a service degradation
-- that may not trigger a hard error alarm

PAGERDUTY RULES:
  recommendations.tier4.empty rate > 1% for 5 minutes → PagerDuty page
  recommendations.tier3.hit rate > 5% for 10 minutes → Slack alert (warning)

Key Interview Points

  • Define degradation tiers upfront, not during incidents: When the recommendation service goes down at 3am, you do not want to decide what to fall back to on the fly. Document tiers and test them in staging before they are needed.
  • Non-critical features should degrade silently: Recommendations going down should not affect the checkout flow. Use independent try/except per feature section. A failure in recommendations must never propagate to payment processing.
  • Feature flags are the surgical tool: Circuit breakers trip automatically; feature flags are manual overrides. During an incident, turning off the “show_ads” flag is faster and safer than touching circuit breaker configuration.
  • Serve stale, never error: A 5-minute-old product list is better than a 500 error. Extend TTLs during degraded operations with stale-if-error and stale-while-revalidate. Users accept slightly stale data; they do not accept broken pages.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you define degradation tiers and what should each tier preserve?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Design a tiered degradation ladder before any incident occurs. Example for an e-commerce site — Tier 0 (full): all features operational. Tier 1 (recommendations down): product search, catalog, checkout functional; recommendations replaced with bestsellers. Tier 2 (search degraded): return cached search results or pre-computed top results; no real-time inventory check. Tier 3 (catalog from cache): serve product pages entirely from cache; disable add-to-cart if inventory service is down. Tier 4 (static fallback): serve a static maintenance page with the phone number for placing orders. Define each tier in a runbook with: what is disabled, what is the user impact, which service failure triggers it, and how to recover. The key principle: each tier must preserve the single most important user action — for e-commerce, that is completing a purchase.”}},{“@type”:”Question”,”name”:”How does the circuit breaker pattern enable automatic graceful degradation?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The circuit breaker wraps calls to a dependency and tracks failure rate over a rolling window. When failures exceed a threshold (e.g., 50% of calls in 60 seconds fail), the breaker OPENS — subsequent calls fail immediately without attempting the network call, returning a cached result or error response instead. After a configurable timeout (e.g., 30 seconds), the breaker enters HALF-OPEN and allows one test call. If it succeeds, the breaker CLOSES and normal operation resumes. If it fails, the breaker stays OPEN. This prevents cascading failures: without a circuit breaker, slow or failing dependency calls hold threads/connections, exhaust the thread pool, and take down the caller. The circuit breaker makes degradation explicit and automatic rather than causing silent cascade.”}},{“@type”:”Question”,”name”:”How do you use feature flags as kill switches for graceful degradation?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Every non-critical feature should have a feature flag that can be disabled in under 30 seconds without a deployment. Store flags in a configuration service (LaunchDarkly, Unleash, or a simple Redis key). The application checks the flag on each request: if recommendations_enabled is false, skip the recommendation service call and return an empty array. During an incident, an on-call engineer flips the flag — all instances see the change within seconds (if polling Redis with a 10-second TTL). This is faster and safer than deploying a code change under incident pressure. Flags provide targeted degradation: disable only the affected feature, not the entire application. Maintain a canonical list of kill switches with their expected fallback behavior for each.”}},{“@type”:”Question”,”name”:”How do you serve stale cache content when the origin is down?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Configure Cache-Control: stale-if-error=3600 in API responses — CDN and browsers will serve stale cached content for up to 1 hour if the origin returns a 5xx error. For internal service caches (Redis): on a cache miss that results in an error from the upstream service, return the last-known cached value (even if expired) rather than propagating the error. Store cache entries with two TTLs: a primary TTL (normal expiry, e.g., 5 minutes) and an extended stale TTL (e.g., 24 hours). On primary TTL expiry, attempt to refresh; if the refresh fails, serve the stale value until the extended TTL expires. This makes the service resilient to brief upstream outages. Alert separately on stale cache hits so engineers know degradation is occurring.”}},{“@type”:”Question”,”name”:”How do you monitor that degradation is happening and recovery has occurred?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Instrument each fallback path explicitly. When the recommendations service is down and you return bestsellers, emit a metric: degradation.recommendations.fallback_count++. Dashboard these metrics alongside error rates. Two signals matter: (1) Fallback activation rate — if recommendations fallback is firing on >1% of requests, something is wrong even if users are not seeing errors. (2) Recovery detection — monitor the circuit breaker state per dependency. Alert when a circuit breaker OPENS (degradation started) and when it CLOSES (recovery confirmed). Without explicit fallback metrics, degradation can persist undetected: users see degraded experience, no errors are logged (because the fallback succeeded), and the on-call team does not know a dependency is down.”}}]}

Graceful degradation and resilient system design is discussed in Netflix system design interview questions.

Graceful degradation and service reliability design is covered in Uber system design interview preparation.

Graceful degradation and fault-tolerant architecture design is discussed in Amazon system design interview guide.

Scroll to Top