Graceful Degradation Low-Level Design

Graceful Degradation — Low-Level Design

Graceful degradation keeps a system functional when individual components fail, returning reduced-quality responses rather than errors. It is the implementation layer beneath circuit breakers and fallbacks: what do you actually return when the recommendation service is down? This design is asked at Netflix, Amazon, and any company operating complex distributed systems.

Degradation Tiers

Define explicit tiers of response quality per feature:

Feature: Product recommendations
  Tier 1 (full): personalized ML recommendations from recommendation service
  Tier 2 (degraded): cached recommendations from 1 hour ago (Redis)
  Tier 3 (minimal): top-selling items from database (static query)
  Tier 4 (none): empty recommendations section (hide the widget entirely)

Feature: User profile
  Tier 1: full profile with all computed fields
  Tier 2: basic profile from cache (may be slightly stale)
  Tier 3: minimal profile (name, avatar only) from fast DB query

Never: return a 500 Internal Server Error to the user for non-critical features.

Fallback Chain Implementation

def get_recommendations(user_id, limit=10):
    """Returns recommendations, degrading gracefully through tiers."""

    # Tier 1: personalized recommendations
    try:
        recs = recommendation_service.get(user_id, limit=limit, timeout=200)
        if recs:
            return {'source': 'personalized', 'items': recs}
    except (TimeoutError, ServiceUnavailable):
        metrics.increment('recommendations.tier1.miss')

    # Tier 2: cached recommendations
    cached = redis.get(f'recs:cached:{user_id}')
    if cached:
        metrics.increment('recommendations.tier2.hit')
        return {'source': 'cached', 'items': json.loads(cached)}

    # Tier 3: top-selling items (always available if DB is up)
    try:
        top_items = db.execute("""
            SELECT product_id FROM ProductStats
            ORDER BY sales_last_7d DESC LIMIT %(lim)s
        """, {'lim': limit}).fetchall()
        if top_items:
            metrics.increment('recommendations.tier3.hit')
            return {'source': 'popular', 'items': [r.product_id for r in top_items]}
    except DatabaseError:
        metrics.increment('recommendations.tier3.miss')

    # Tier 4: empty (hide widget)
    metrics.increment('recommendations.tier4.empty')
    return {'source': 'unavailable', 'items': []}

Feature Flags for Degradation Control

-- During an incident: disable expensive features to shed load

def handle_request(user_id):
    page = {}

    # Check feature flags for each section
    if feature_flag('show_recommendations', user_id):
        page['recommendations'] = get_recommendations(user_id)
    else:
        page['recommendations'] = []  # Disabled: reduce load on recommendation service

    if feature_flag('show_social_feed', user_id):
        page['feed'] = get_social_feed(user_id)
    # else: omit the section entirely

    if feature_flag('show_ads', user_id):
        page['ads'] = get_ads(user_id)

    return page

-- Incident runbook: when DB hits 90% CPU:
--   1. Disable 'show_ads' (removes 30% of DB queries)
--   2. Disable 'show_recommendations' (removes another 20%)
--   3. Switch 'show_social_feed' to cached-only mode
-- Each kill switch has a known impact percentage documented in the runbook

Read-Through Cache as Degradation Buffer

def get_user_profile(user_id):
    cache_key = f'profile:{user_id}'

    # Serve stale if DB is unavailable
    cached = redis.get(cache_key)

    try:
        profile = db.execute(
            "SELECT * FROM User WHERE id=%(id)s", {'id': user_id}
        ).first()
        # Refresh cache with longer TTL to buffer future DB outages
        redis.setex(cache_key, 3600, json.dumps(profile))
        return profile
    except DatabaseError:
        if cached:
            metrics.increment('profile.served_stale')
            return json.loads(cached)  # Serve stale during DB outage
        raise  # No cache, cannot degrade

-- Set Cache-Control: stale-if-error=86400 on HTTP responses
-- CDN serves the cached response for up to 24 hours if origin returns 5xx

Degradation vs Error Monitoring

-- Track degradation events to detect when tier 1 is failing at scale

METRICS TO MONITOR:
  recommendations.tier1.hit       → should be >95%
  recommendations.tier2.hit       → alarm if >5% (tier1 failing)
  recommendations.tier3.hit       → alarm if >2% (both tier1 and tier2 failing)
  recommendations.tier4.empty     → page alert (full degradation)

-- Dashboard: stacked area chart of recommendation source distribution over time
-- A sudden shift from tier1 to tier2/3 indicates a service degradation
-- that may not trigger a hard error alarm

PAGERDUTY RULES:
  recommendations.tier4.empty rate > 1% for 5 minutes → PagerDuty page
  recommendations.tier3.hit rate > 5% for 10 minutes → Slack alert (warning)

Key Interview Points

  • Define degradation tiers upfront, not during incidents: When the recommendation service goes down at 3am, you do not want to decide what to fall back to on the fly. Document tiers and test them in staging before they are needed.
  • Non-critical features should degrade silently: Recommendations going down should not affect the checkout flow. Use independent try/except per feature section. A failure in recommendations must never propagate to payment processing.
  • Feature flags are the surgical tool: Circuit breakers trip automatically; feature flags are manual overrides. During an incident, turning off the “show_ads” flag is faster and safer than touching circuit breaker configuration.
  • Serve stale, never error: A 5-minute-old product list is better than a 500 error. Extend TTLs during degraded operations with stale-if-error and stale-while-revalidate. Users accept slightly stale data; they do not accept broken pages.

Graceful degradation and resilient system design is discussed in Netflix system design interview questions.

Graceful degradation and service reliability design is covered in Uber system design interview preparation.

Graceful degradation and fault-tolerant architecture design is discussed in Amazon system design interview guide.

Scroll to Top