Low Level Design: Sitemap Generator Service

Sitemap XML Format

urlset with url entries: loc, lastmod, changefreq, priority.

Sitemap Index

For sites with >50K URLs, generate multiple sitemap files + index file referencing them.

SitemapEntry Table

SitemapEntry (
  id,
  url,
  lastmod,
  changefreq: always/hourly/daily/weekly/monthly/yearly/never,
  priority DECIMAL 0.0-1.0,
  sitemap_file_id
)

Generation Process

Query SitemapEntry ordered by priority DESC → batch into files of max 50K URLs → generate XML → gzip → upload to S3 (sitemap1.xml.gz, sitemap2.xml.gz) → generate sitemap-index.xml.

Incremental Update

On content publish/update, upsert SitemapEntry + re-generate only affected sitemap file.

Priority Rules

  • Homepage: 1.0
  • Category pages: 0.8
  • Content pages: 0.7
  • Tag pages: 0.5

Exclusions

Dynamic pages excluded: search results, user-specific pages, admin URLs.

robots.txt Integration

Auto-append Sitemap: URL line to robots.txt.

Search Engine Ping

GET request to Google/Bing ping URL after generation.

Schedule

Full regeneration weekly + incremental on every publish.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a sitemap index file and when is it needed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A sitemap index file references multiple individual sitemap files and is required when a site has more than 50,000 URLs or when sitemap files exceed 50MB. Each individual sitemap can contain up to 50,000 URLs.”
}
},
{
“@type”: “Question”,
“name”: “How should priority values be assigned in a sitemap?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Priority is a decimal between 0.0 and 1.0. Common assignments: homepage=1.0, category pages=0.8, content/article pages=0.7, tag pages=0.5. Dynamic pages like search results and user-specific pages should be excluded entirely.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle incremental sitemap updates efficiently?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “On each content publish or update, upsert the corresponding SitemapEntry row and regenerate only the affected sitemap file rather than rebuilding the entire sitemap. Full regeneration is run on a weekly schedule to catch any inconsistencies.”
}
},
{
“@type”: “Question”,
“name”: “How do you notify search engines after generating a sitemap?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “After generation, send a GET request to the Google and Bing ping endpoints with the sitemap URL. Also ensure the sitemap URL is listed in robots.txt via a Sitemap: directive so crawlers can discover it automatically.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Atlassian Interview Guide

See also: Shopify Interview Guide

Scroll to Top