Link Preview Service Low-Level Design: SSRF Prevention, Open Graph Parsing, and Caching

A link preview service fetches URL metadata (title, description, image) to render rich embeds when users share links — familiar from Slack, Twitter, iMessage, and Notion. Core challenges: fetching external URLs safely (SSRF prevention), extracting Open Graph / Twitter Card metadata reliably, caching aggressively to avoid refetching on every view, and handling slow or malicious origin servers.

Core Data Model

CREATE TABLE LinkPreview (
    url_hash      CHAR(64) PRIMARY KEY,  -- SHA-256 of normalized URL
    url           TEXT NOT NULL,
    title         TEXT,
    description   TEXT,
    image_url     TEXT,
    site_name     TEXT,
    favicon_url   TEXT,
    content_type  TEXT,                  -- 'article', 'video', 'image', 'website'
    status        TEXT NOT NULL DEFAULT 'pending',  -- pending, ready, failed, blocked
    fetch_status  SMALLINT,              -- HTTP status code from origin
    fetched_at    TIMESTAMPTZ,
    expires_at    TIMESTAMPTZ NOT NULL,  -- cache TTL
    error_message TEXT
);
CREATE INDEX idx_preview_expires ON LinkPreview (expires_at) WHERE status = 'ready';

SSRF Prevention and URL Validation

import ipaddress, socket, urllib.parse
from typing import NamedTuple

BLOCKED_RANGES = [
    ipaddress.ip_network("10.0.0.0/8"),
    ipaddress.ip_network("172.16.0.0/12"),
    ipaddress.ip_network("192.168.0.0/16"),
    ipaddress.ip_network("127.0.0.0/8"),
    ipaddress.ip_network("169.254.0.0/16"),   # link-local (AWS metadata: 169.254.169.254)
    ipaddress.ip_network("::1/128"),
    ipaddress.ip_network("fc00::/7"),
]
ALLOWED_SCHEMES = {'http', 'https'}
MAX_REDIRECTS = 3
FETCH_TIMEOUT_SEC = 5

def is_ssrf_safe(url: str) -> tuple[bool, str]:
    """
    Returns (is_safe, reason).
    Blocks private IPs, localhost, link-local ranges, and non-http(s) schemes.
    """
    try:
        parsed = urllib.parse.urlparse(url)
    except Exception:
        return False, "Invalid URL"

    if parsed.scheme not in ALLOWED_SCHEMES:
        return False, f"Scheme {parsed.scheme} not allowed"

    hostname = parsed.hostname
    if not hostname:
        return False, "No hostname"

    # Resolve DNS to IP
    try:
        infos = socket.getaddrinfo(hostname, None)
    except socket.gaierror:
        return False, "DNS resolution failed"

    for info in infos:
        ip_str = info[4][0]
        try:
            ip = ipaddress.ip_address(ip_str)
        except ValueError:
            continue
        for blocked in BLOCKED_RANGES:
            if ip in blocked:
                return False, f"IP {ip} is in blocked range {blocked}"

    return True, ""

Fetching and Parsing Metadata

import hashlib, requests
from bs4 import BeautifulSoup
from datetime import datetime, timezone, timedelta
import psycopg2

PREVIEW_CACHE_TTL_HOURS = 24
MAX_RESPONSE_BYTES = 1 * 1024 * 1024  # 1 MiB — don't fetch entire large pages

def normalize_url(url: str) -> str:
    """Lowercase scheme+host, strip tracking params, sort query params."""
    parsed = urllib.parse.urlparse(url.strip())
    # Strip common tracking params
    TRACKING_PARAMS = {'utm_source','utm_medium','utm_campaign','utm_content','utm_term','fbclid','gclid'}
    qs = urllib.parse.parse_qs(parsed.query, keep_blank_values=True)
    qs_clean = {k: v for k, v in qs.items() if k not in TRACKING_PARAMS}
    clean_query = urllib.parse.urlencode(sorted(qs_clean.items()), doseq=True)
    normalized = parsed._replace(
        scheme=parsed.scheme.lower(),
        netloc=parsed.netloc.lower(),
        query=clean_query,
        fragment=""
    )
    return urllib.parse.urlunparse(normalized)

def fetch_link_preview(conn, url: str) -> dict:
    normalized = normalize_url(url)
    url_hash = hashlib.sha256(normalized.encode()).hexdigest()

    # Check cache first
    with conn.cursor() as cur:
        cur.execute(
            "SELECT title, description, image_url, site_name, status FROM LinkPreview WHERE url_hash = %s AND expires_at > NOW()",
            (url_hash,)
        )
        cached = cur.fetchone()
    if cached:
        return {"title": cached[0], "description": cached[1], "image_url": cached[2],
                "site_name": cached[3], "status": cached[4], "cached": True}

    # SSRF check
    safe, reason = is_ssrf_safe(normalized)
    if not safe:
        store_preview(conn, url_hash, normalized, status='blocked', error=reason)
        raise ValueError(f"URL blocked: {reason}")

    # Fetch with timeout and size limit
    try:
        resp = requests.get(
            normalized,
            timeout=FETCH_TIMEOUT_SEC,
            headers={"User-Agent": "TechInterviewBot/1.0 (+https://techinterview.org)"},
            allow_redirects=True,
            stream=True,
            max_redirects=MAX_REDIRECTS
        )
        # Read only the first 1MiB
        content = b""
        for chunk in resp.iter_content(chunk_size=8192):
            content += chunk
            if len(content) > MAX_RESPONSE_BYTES:
                break
        resp_content = content.decode('utf-8', errors='replace')
    except requests.exceptions.Timeout:
        store_preview(conn, url_hash, normalized, status='failed', error='Fetch timeout')
        raise

    # Parse Open Graph and Twitter Card tags
    soup = BeautifulSoup(resp_content, 'html.parser')

    def og(prop):
        tag = soup.find('meta', property=f'og:{prop}') or soup.find('meta', attrs={'name': f'twitter:{prop}'})
        return tag['content'].strip() if tag and tag.get('content') else None

    title = og('title') or (soup.title.string.strip() if soup.title else None)
    description = og('description') or (soup.find('meta', attrs={'name': 'description'}) or {}).get('content', '')
    image_url = og('image')
    site_name = og('site_name') or urllib.parse.urlparse(normalized).netloc

    # Validate image URL (must also be SSRF-safe)
    if image_url:
        img_safe, _ = is_ssrf_safe(image_url)
        if not img_safe:
            image_url = None

    preview = store_preview(conn, url_hash, normalized, status='ready',
                             title=title, description=description,
                             image_url=image_url, site_name=site_name,
                             fetch_status=resp.status_code)
    return preview

def store_preview(conn, url_hash, url, status, title=None, description=None,
                   image_url=None, site_name=None, fetch_status=None, error=None) -> dict:
    expires_at = datetime.now(timezone.utc) + timedelta(hours=PREVIEW_CACHE_TTL_HOURS)
    with conn.cursor() as cur:
        cur.execute("""
            INSERT INTO LinkPreview (url_hash, url, title, description, image_url, site_name,
                status, fetch_status, fetched_at, expires_at, error_message)
            VALUES (%s,%s,%s,%s,%s,%s,%s,%s,NOW(),%s,%s)
            ON CONFLICT (url_hash) DO UPDATE SET
                title=EXCLUDED.title, description=EXCLUDED.description,
                image_url=EXCLUDED.image_url, site_name=EXCLUDED.site_name,
                status=EXCLUDED.status, fetch_status=EXCLUDED.fetch_status,
                fetched_at=NOW(), expires_at=EXCLUDED.expires_at,
                error_message=EXCLUDED.error_message
        """, (url_hash, url, title, description, image_url, site_name,
              status, fetch_status, expires_at, error))
    conn.commit()
    return {"title": title, "description": description, "image_url": image_url,
            "site_name": site_name, "status": status}

Key Interview Points

  • SSRF is the critical security issue: Without validation, an attacker submits URLs like http://169.254.169.254/latest/meta-data/ (AWS metadata endpoint) or http://localhost:6379/ (Redis). Your server fetches these internal endpoints and returns credentials or data. Defenses: (1) resolve DNS and check the resulting IP against blocked ranges before fetching; (2) use a dedicated egress proxy (Smokescreen, Squid with ACLs) that enforces allowlists; (3) disable redirects to private IPs even if the initial URL passes.
  • URL normalization for cache deduplication: https://example.com/post?id=1&utm_source=twitter and https://example.com/post?id=1 point to the same page. Normalize before hashing: lowercase scheme/host, sort query params, strip tracking params, remove fragment. This maximizes cache hit rate — critical because every cache miss is an outbound HTTP request.
  • Cache TTL tradeoffs: 24-hour TTL is a reasonable default — og:image and titles rarely change. For news articles, shorter TTL (1-2 hours) captures headline changes. For user-generated content, respect Cache-Control: max-age from the origin server. Never cache status=’failed’ permanently — retry after 1 hour.
  • Asynchronous fetch pattern: Don’t block the user’s message send waiting for the fetch. Return the message immediately with status=’pending’. Fetch in a background job. Push a WebSocket update when the preview is ready. This decouples perceived latency (instant send) from actual fetch time (up to 5 seconds). Prefetch on paste in the client-side editor — the preview is usually ready by the time the user hits Send.
  • Image proxying: Serve og:image through your own proxy (proxy.example.com/img?url=…) rather than embedding third-party URLs. Benefits: (1) security — prevent mixed content warnings (http:// images on https:// pages); (2) privacy — the origin doesn’t see your users’ IP addresses; (3) reliability — cache the image so it persists even if origin deletes it. Apply same SSRF validation to proxied image URLs.

Link preview and URL unfurling system design is discussed in Twitter system design interview questions.

Link preview and content sharing metadata design is covered in LinkedIn system design interview preparation.

Link preview and rich media embed design is discussed in Snap system design interview guide.

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Atlassian Interview Guide

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

Scroll to Top