Reading List Service: Core Schema
The reading list service saves articles for later reading with content extraction and offline availability.
ReadingItem Table
id BIGSERIAL PRIMARY KEY
user_id BIGINT NOT NULL
url TEXT NOT NULL
title TEXT
author TEXT
word_count INT
estimated_read_minutes INT
status TEXT DEFAULT 'unread' -- unread/reading/completed/archived
progress_pct INT DEFAULT 0 -- 0-100
content_s3_key TEXT -- extracted article content in S3
saved_at TIMESTAMPTZ DEFAULT now()
read_at TIMESTAMPTZ
UNIQUE (user_id, url)
Content Extraction Pipeline
An async worker extracts the readable article body from each saved URL, stripping ads, navigation, and boilerplate.
Worker flow:
1. Dequeue item_id from extraction queue
2. HTTP GET url
3. Run Mozilla Readability algorithm on HTML
→ extract article body, title, author, word count
4. Strip remaining ads/nav/scripts
5. Store cleaned HTML in S3 (content_s3_key = users/{uid}/items/{id}.html)
6. UPDATE reading_items SET content_s3_key=..., word_count=...,
estimated_read_minutes=CEIL(word_count/238.0) WHERE id=item_id
Word count drives estimated_read_minutes using an average adult reading speed of 238 words per minute.
Offline Reading
Mobile clients download extracted content over WiFi for offline access. A local tracking table prevents redundant downloads.
ContentDownload Table (client-side)
item_id BIGINT PRIMARY KEY
downloaded_at TIMESTAMPTZ
local_path TEXT
size_bytes INT
The client syncs the list of items with content_s3_key set, compares against local downloads, and fetches missing content only on WiFi by default.
Progress Tracking
The reading client sends scroll position updates to the server periodically to persist reading progress across devices.
-- Client sends every 30s while reading:
PATCH /api/items/{id}/progress
Body: { "progress_pct": 42 }
-- Server:
UPDATE reading_items SET progress_pct=$1, status='reading'
WHERE id=$2 AND user_id=$3;
-- If progress_pct >= 95: SET status='completed', read_at=now()
Re-Surface and Daily Digest
Items that remain unread for more than 7 days are surfaced in a daily digest email to re-engage the user.
-- Nightly job query:
SELECT * FROM reading_items
WHERE user_id=$1 AND status='unread'
AND saved_at < now() - INTERVAL '7 days'
ORDER BY saved_at ASC
LIMIT 5;
The digest email lists up to 5 stale unread items with title, estimated read time, and a direct link to the reader view.
Tagging and Full-Text Search
Tags and full-text search follow the same pattern as the bookmark service: JSONB tags column with GIN index, plus a generated tsvector column on title and extracted author for keyword search.
CREATE INDEX idx_reading_items_tags ON reading_items USING GIN (tags);
ALTER TABLE reading_items ADD COLUMN fts tsvector
GENERATED ALWAYS AS (
to_tsvector('english', coalesce(title,'') || ' ' || coalesce(author,''))
) STORED;
CREATE INDEX idx_reading_items_fts ON reading_items USING GIN (fts);
Reading Statistics
Aggregate stats are computed on read from the reading_items table. For high-traffic deployments, a nightly rollup into a stats table reduces query cost.
SELECT
COUNT(*) FILTER (WHERE status='completed') AS articles_read,
SUM(word_count) FILTER (WHERE status='completed') AS total_words_read,
AVG(estimated_read_minutes) FILTER (WHERE status='completed') AS avg_read_minutes
FROM reading_items WHERE user_id=$1;
Smart Recommendations
After a user completes an article, the system extracts its tags and finds unread items with overlapping tags, ranked by tag overlap count.
-- Find unread items sharing tags with recently completed items:
SELECT ri.id, ri.title, COUNT(*) AS tag_overlap
FROM reading_items ri,
jsonb_array_elements_text(ri.tags) t
WHERE ri.user_id=$1 AND ri.status='unread'
AND t IN (SELECT jsonb_array_elements_text(tags)
FROM reading_items
WHERE user_id=$1 AND status='completed'
AND read_at > now() - INTERVAL '30 days')
GROUP BY ri.id, ri.title
ORDER BY tag_overlap DESC
LIMIT 10;
Auto-Archive
Items completed more than 90 days ago are automatically archived to reduce inbox clutter. A nightly batch job performs the transition.
UPDATE reading_items
SET status='archived'
WHERE status='completed'
AND read_at < now() - INTERVAL '90 days';
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does content extraction work in a reading list service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An async worker fetches the saved URL, runs the Mozilla Readability algorithm to extract the article body, strips ads and navigation elements, and stores the cleaned HTML in S3. The S3 key is then written back to the ReadingItem record so the client can download it for offline reading.”
}
},
{
“@type”: “Question”,
“name”: “How is reading progress tracked across devices?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The reading client sends the current scroll position as a percentage to the server every 30 seconds. The server updates the progress_pct field on the ReadingItem row. When progress reaches 95% or above, the status is automatically set to completed and read_at is recorded.”
}
},
{
“@type”: “Question”,
“name”: “How does the re-surface recommendation system work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A nightly job queries for items with status=unread saved more than 7 days ago and includes up to 5 of them in a daily digest email. Smart recommendations are generated by finding unread items whose tags overlap with tags from articles the user completed in the past 30 days, ranked by overlap count.”
}
},
{
“@type”: “Question”,
“name”: “How is estimated reading time calculated?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “After content extraction, the worker counts the words in the extracted article body and divides by 238, the average adult reading speed in words per minute, rounding up. The result is stored as estimated_read_minutes on the ReadingItem row and displayed to the user before they open the article.”
}
}
]
}
See also: Apple Interview Guide 2026: iOS Systems, Hardware-Software Integration, and iCloud Architecture
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering