Graceful Degradation — Low-Level Design
Graceful degradation keeps a system functional when individual components fail, returning reduced-quality responses rather than errors. It is the implementation layer beneath circuit breakers and fallbacks: what do you actually return when the recommendation service is down? This design is asked at Netflix, Amazon, and any company operating complex distributed systems.
Degradation Tiers
Define explicit tiers of response quality per feature:
Feature: Product recommendations
Tier 1 (full): personalized ML recommendations from recommendation service
Tier 2 (degraded): cached recommendations from 1 hour ago (Redis)
Tier 3 (minimal): top-selling items from database (static query)
Tier 4 (none): empty recommendations section (hide the widget entirely)
Feature: User profile
Tier 1: full profile with all computed fields
Tier 2: basic profile from cache (may be slightly stale)
Tier 3: minimal profile (name, avatar only) from fast DB query
Never: return a 500 Internal Server Error to the user for non-critical features.
Fallback Chain Implementation
def get_recommendations(user_id, limit=10):
"""Returns recommendations, degrading gracefully through tiers."""
# Tier 1: personalized recommendations
try:
recs = recommendation_service.get(user_id, limit=limit, timeout=200)
if recs:
return {'source': 'personalized', 'items': recs}
except (TimeoutError, ServiceUnavailable):
metrics.increment('recommendations.tier1.miss')
# Tier 2: cached recommendations
cached = redis.get(f'recs:cached:{user_id}')
if cached:
metrics.increment('recommendations.tier2.hit')
return {'source': 'cached', 'items': json.loads(cached)}
# Tier 3: top-selling items (always available if DB is up)
try:
top_items = db.execute("""
SELECT product_id FROM ProductStats
ORDER BY sales_last_7d DESC LIMIT %(lim)s
""", {'lim': limit}).fetchall()
if top_items:
metrics.increment('recommendations.tier3.hit')
return {'source': 'popular', 'items': [r.product_id for r in top_items]}
except DatabaseError:
metrics.increment('recommendations.tier3.miss')
# Tier 4: empty (hide widget)
metrics.increment('recommendations.tier4.empty')
return {'source': 'unavailable', 'items': []}
Feature Flags for Degradation Control
-- During an incident: disable expensive features to shed load
def handle_request(user_id):
page = {}
# Check feature flags for each section
if feature_flag('show_recommendations', user_id):
page['recommendations'] = get_recommendations(user_id)
else:
page['recommendations'] = [] # Disabled: reduce load on recommendation service
if feature_flag('show_social_feed', user_id):
page['feed'] = get_social_feed(user_id)
# else: omit the section entirely
if feature_flag('show_ads', user_id):
page['ads'] = get_ads(user_id)
return page
-- Incident runbook: when DB hits 90% CPU:
-- 1. Disable 'show_ads' (removes 30% of DB queries)
-- 2. Disable 'show_recommendations' (removes another 20%)
-- 3. Switch 'show_social_feed' to cached-only mode
-- Each kill switch has a known impact percentage documented in the runbook
Read-Through Cache as Degradation Buffer
def get_user_profile(user_id):
cache_key = f'profile:{user_id}'
# Serve stale if DB is unavailable
cached = redis.get(cache_key)
try:
profile = db.execute(
"SELECT * FROM User WHERE id=%(id)s", {'id': user_id}
).first()
# Refresh cache with longer TTL to buffer future DB outages
redis.setex(cache_key, 3600, json.dumps(profile))
return profile
except DatabaseError:
if cached:
metrics.increment('profile.served_stale')
return json.loads(cached) # Serve stale during DB outage
raise # No cache, cannot degrade
-- Set Cache-Control: stale-if-error=86400 on HTTP responses
-- CDN serves the cached response for up to 24 hours if origin returns 5xx
Degradation vs Error Monitoring
-- Track degradation events to detect when tier 1 is failing at scale
METRICS TO MONITOR:
recommendations.tier1.hit → should be >95%
recommendations.tier2.hit → alarm if >5% (tier1 failing)
recommendations.tier3.hit → alarm if >2% (both tier1 and tier2 failing)
recommendations.tier4.empty → page alert (full degradation)
-- Dashboard: stacked area chart of recommendation source distribution over time
-- A sudden shift from tier1 to tier2/3 indicates a service degradation
-- that may not trigger a hard error alarm
PAGERDUTY RULES:
recommendations.tier4.empty rate > 1% for 5 minutes → PagerDuty page
recommendations.tier3.hit rate > 5% for 10 minutes → Slack alert (warning)
Key Interview Points
- Define degradation tiers upfront, not during incidents: When the recommendation service goes down at 3am, you do not want to decide what to fall back to on the fly. Document tiers and test them in staging before they are needed.
- Non-critical features should degrade silently: Recommendations going down should not affect the checkout flow. Use independent try/except per feature section. A failure in recommendations must never propagate to payment processing.
- Feature flags are the surgical tool: Circuit breakers trip automatically; feature flags are manual overrides. During an incident, turning off the “show_ads” flag is faster and safer than touching circuit breaker configuration.
- Serve stale, never error: A 5-minute-old product list is better than a 500 error. Extend TTLs during degraded operations with stale-if-error and stale-while-revalidate. Users accept slightly stale data; they do not accept broken pages.
Graceful degradation and resilient system design is discussed in Netflix system design interview questions.
Graceful degradation and service reliability design is covered in Uber system design interview preparation.
Graceful degradation and fault-tolerant architecture design is discussed in Amazon system design interview guide.