Sports Data Feed Low-Level Design: Event Ingestion, Normalization, and Multi-Sport Schema

What Is a Sports Data Feed?

A sports data feed aggregates real-time and historical event data from multiple league-official and third-party data providers, normalizes heterogeneous provider schemas into a unified canonical model, and delivers low-latency structured data to downstream consumers: apps, fantasy platforms, sportsbooks, and analytics systems. The key design challenges are provider diversity, schema normalization, real-time delivery, and maintaining consistency across providers that may report the same event differently.

Requirements

Functional Requirements

Ingest event streams from multiple providers via push webhooks and pull polling for each supported sport.
Normalize events from different provider schemas into a unified canonical event schema.
Detect and reconcile duplicate or conflicting events from multiple providers for the same game.
Deliver normalized events to downstream consumers via Kafka topics and REST snapshot APIs.
Support at least football, basketball, baseball, and soccer with sport-specific event taxonomies.

Non-Functional Requirements

Normalized event available to consumers within 3 seconds of provider delivery.
Support 500 concurrent live games across all sports.
Normalization pipeline throughput of 100,000 events per minute.
Schema registry versioned so consumers can pin to a schema version without breaking on provider changes.

Data Model

The CanonicalEvent is the central normalized record: event ID (UUID), game ID, sport type, event type (from a per-sport taxonomy enum), timestamp, period, game clock, acting entity (team or player), target entity, coordinates (for spatial sports), and a raw payload blob storing the original provider JSON for debugging and re-normalization. The GameRoster links provider-specific player and team IDs to canonical entity IDs, enabling cross-provider entity resolution. The ProviderMapping table stores per-provider, per-sport mapping rules: field paths, enum translations, and unit conversions, loaded into memory at startup to drive the normalization pipeline without runtime database calls.

Core Algorithms

Multi-Provider Ingestion

Each provider adapter is a lightweight service that accepts the provider-specific transport (webhook HTTP POST, WebSocket stream, or polled REST endpoint) and publishes raw events to a provider-specific Kafka topic. This decouples ingestion from normalization: adapters can be deployed, restarted, or swapped without affecting downstream stages. Each raw event is tagged with provider ID, ingestion timestamp, and a content hash for deduplication tracking.

Normalization Pipeline

A normalization service consumes all provider topics and applies a two-phase transformation. Phase one maps provider field names and values to canonical fields using the ProviderMapping rules loaded from the registry — a simple JSON Path extraction and enum lookup with no database I/O. Phase two resolves entity IDs: provider-specific player and team identifiers are looked up in an in-memory cache of the GameRoster (refreshed at game start and on substitution events), producing canonical entity UUIDs. The resulting CanonicalEvent is published to a unified sports.events.canonical Kafka topic.

Conflict Reconciliation

When two providers report the same logical event (same game, same game clock, same event type), the deduplication layer computes a composite key from (game ID, period, game clock rounded to 1 second, event type, acting entity ID) and checks a Redis set of recently seen keys with a 30-second TTL. The first event through sets the canonical record; subsequent events within the window trigger a reconciliation check that merges any additional attributes (e.g., provider B supplies coordinates that provider A omitted) and logs the merge for QA auditing.

API Design

Downstream consumers interact via two interfaces. The streaming interface is a Kafka topic (sports.events.canonical) partitioned by game ID, guaranteeing ordering per game and allowing consumer groups to scale independently. The REST snapshot interface provides GET /v1/games/{id}/events?sport={type}&since={timestamp} for consumers that need historical replay or REST-based integration. A GET /v1/schema/{sport}/{version} endpoint exposes the JSON Schema definition for each sport and schema version from the schema registry, enabling consumer-side validation and safe schema evolution with backward compatibility guarantees.

Scalability and Infrastructure

The normalization pipeline runs as stateless Kafka Streams processors, one topology per sport, scaling horizontally by adding partitions and processor instances. The ProviderMapping registry is stored in a configuration database with change-data-capture: updates propagate to all processor instances via a dedicated Kafka compacted topic, so mapping rule changes take effect within seconds without rolling restarts. The GameRoster cache is warmed from a pre-game data load pushed 30 minutes before kickoff and kept current via substitution events during play. Canonical events are also written to a columnar store (Apache Iceberg on S3) in near-real-time for analytics and ML training workloads, with partitioning by sport, date, and game ID for efficient query pruning.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you handle multi-provider ingestion in a sports feed system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each data provider gets its own adapter that speaks the provider's protocol (REST polling, WebSocket, SFTP batch) and translates raw payloads into a canonical internal event format. Adapters run as independent services so a single provider outage doesn't block others. A provider-priority config allows the system to prefer the most authoritative source for a given sport or league and fall back to secondary providers automatically.”
}
},
{
“@type”: “Question”,
“name”: “What does an event normalization pipeline look like in a sports feed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raw provider events pass through a multi-stage pipeline: (1) schema validation against a provider-specific JSON Schema, (2) entity resolution that maps provider team/player IDs to internal canonical IDs using a reference data service, (3) field normalization (timestamps to UTC, score format standardization, sport-specific stat renaming), and (4) deduplication based on a composite key of (game_id, event_type, provider_event_id). Only clean, deduplicated events are forwarded downstream.”
}
},
{
“@type”: “Question”,
“name”: “How do you define a unified sport schema across multiple sports?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A base schema captures universal concepts: game (participants, venue, scheduled start, status), event (timestamp, sequence, type), and score (value, period). Sport-specific schemas extend the base with discriminated union fields (e.g., American football adds down/distance, basketball adds shot clock). Protobuf or Avro with schema registry enforces compatibility and allows consumers to ignore unknown fields when new sports are added.”
}
},
{
“@type”: “Question”,
“name”: “How do you achieve low-latency delivery in a sports feed architecture?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Low latency comes from minimizing processing hops: co-locating ingestion adapters in the same region as provider endpoints, using in-memory queues (Kafka with low-latency producer settings: linger.ms=0, acks=1) rather than disk-backed batch pipelines, and pushing events to edge PoPs via a CDN-integrated WebSocket or SSE layer. End-to-end P99 latency from provider event to client delivery can reach under 200ms with this architecture.”
}
}
]
}