Low Level Design: Reading List Service

Reading List Service: Core Schema

The reading list service saves articles for later reading with content extraction and offline availability.

ReadingItem Table

id                   BIGSERIAL PRIMARY KEY
user_id              BIGINT NOT NULL
url                  TEXT NOT NULL
title                TEXT
author               TEXT
word_count           INT
estimated_read_minutes INT
status               TEXT DEFAULT 'unread'  -- unread/reading/completed/archived
progress_pct         INT DEFAULT 0          -- 0-100
content_s3_key       TEXT                   -- extracted article content in S3
saved_at             TIMESTAMPTZ DEFAULT now()
read_at              TIMESTAMPTZ

UNIQUE (user_id, url)

Content Extraction Pipeline

An async worker extracts the readable article body from each saved URL, stripping ads, navigation, and boilerplate.

Worker flow:
1. Dequeue item_id from extraction queue
2. HTTP GET url
3. Run Mozilla Readability algorithm on HTML
   → extract article body, title, author, word count
4. Strip remaining ads/nav/scripts
5. Store cleaned HTML in S3 (content_s3_key = users/{uid}/items/{id}.html)
6. UPDATE reading_items SET content_s3_key=..., word_count=...,
       estimated_read_minutes=CEIL(word_count/238.0) WHERE id=item_id

Word count drives estimated_read_minutes using an average adult reading speed of 238 words per minute.

Offline Reading

Mobile clients download extracted content over WiFi for offline access. A local tracking table prevents redundant downloads.

ContentDownload Table (client-side)

item_id       BIGINT PRIMARY KEY
downloaded_at TIMESTAMPTZ
local_path    TEXT
size_bytes    INT

The client syncs the list of items with content_s3_key set, compares against local downloads, and fetches missing content only on WiFi by default.

Progress Tracking

The reading client sends scroll position updates to the server periodically to persist reading progress across devices.

-- Client sends every 30s while reading:
PATCH /api/items/{id}/progress
Body: { "progress_pct": 42 }

-- Server:
UPDATE reading_items SET progress_pct=$1, status='reading'
WHERE id=$2 AND user_id=$3;
-- If progress_pct >= 95: SET status='completed', read_at=now()

Re-Surface and Daily Digest

Items that remain unread for more than 7 days are surfaced in a daily digest email to re-engage the user.

-- Nightly job query:
SELECT * FROM reading_items
WHERE user_id=$1 AND status='unread'
  AND saved_at < now() - INTERVAL '7 days'
ORDER BY saved_at ASC
LIMIT 5;

The digest email lists up to 5 stale unread items with title, estimated read time, and a direct link to the reader view.

Tagging and Full-Text Search

Tags and full-text search follow the same pattern as the bookmark service: JSONB tags column with GIN index, plus a generated tsvector column on title and extracted author for keyword search.

CREATE INDEX idx_reading_items_tags ON reading_items USING GIN (tags);
ALTER TABLE reading_items ADD COLUMN fts tsvector
    GENERATED ALWAYS AS (
        to_tsvector('english', coalesce(title,'') || ' ' || coalesce(author,''))
    ) STORED;
CREATE INDEX idx_reading_items_fts ON reading_items USING GIN (fts);

Reading Statistics

Aggregate stats are computed on read from the reading_items table. For high-traffic deployments, a nightly rollup into a stats table reduces query cost.

SELECT
    COUNT(*) FILTER (WHERE status='completed')    AS articles_read,
    SUM(word_count) FILTER (WHERE status='completed') AS total_words_read,
    AVG(estimated_read_minutes) FILTER (WHERE status='completed') AS avg_read_minutes
FROM reading_items WHERE user_id=$1;

Smart Recommendations

After a user completes an article, the system extracts its tags and finds unread items with overlapping tags, ranked by tag overlap count.

-- Find unread items sharing tags with recently completed items:
SELECT ri.id, ri.title, COUNT(*) AS tag_overlap
FROM reading_items ri,
     jsonb_array_elements_text(ri.tags) t
WHERE ri.user_id=$1 AND ri.status='unread'
  AND t IN (SELECT jsonb_array_elements_text(tags)
            FROM reading_items
            WHERE user_id=$1 AND status='completed'
              AND read_at > now() - INTERVAL '30 days')
GROUP BY ri.id, ri.title
ORDER BY tag_overlap DESC
LIMIT 10;

Auto-Archive

Items completed more than 90 days ago are automatically archived to reduce inbox clutter. A nightly batch job performs the transition.

UPDATE reading_items
SET status='archived'
WHERE status='completed'
  AND read_at < now() - INTERVAL '90 days';

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does content extraction work in a reading list service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An async worker fetches the saved URL, runs the Mozilla Readability algorithm to extract the article body, strips ads and navigation elements, and stores the cleaned HTML in S3. The S3 key is then written back to the ReadingItem record so the client can download it for offline reading.”
}
},
{
“@type”: “Question”,
“name”: “How is reading progress tracked across devices?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The reading client sends the current scroll position as a percentage to the server every 30 seconds. The server updates the progress_pct field on the ReadingItem row. When progress reaches 95% or above, the status is automatically set to completed and read_at is recorded.”
}
},
{
“@type”: “Question”,
“name”: “How does the re-surface recommendation system work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A nightly job queries for items with status=unread saved more than 7 days ago and includes up to 5 of them in a daily digest email. Smart recommendations are generated by finding unread items whose tags overlap with tags from articles the user completed in the past 30 days, ranked by overlap count.”
}
},
{
“@type”: “Question”,
“name”: “How is estimated reading time calculated?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “After content extraction, the worker counts the words in the extracted article body and divides by 238, the average adult reading speed in words per minute, rounding up. The result is stored as estimated_read_minutes on the ReadingItem row and displayed to the user before they open the article.”
}
}
]
}