Low Level Design: Customer Support Ticket System

Q: How do you model SLA tiers (P1, P2, P3) in a support ticket system?

Define an sla_policy table with columns (priority, first_response_minutes, resolution_minutes, business_hours_only). When a ticket is created, look up the matching policy and persist sla_breach_at = created_at + response_window on the ticket record. A background scheduler (e.g., a cron job every minute) queries for tickets where sla_breach_at < NOW() AND status != 'resolved' and fires escalation events. P1 tickets (critical outages) typically target 15-minute response and 4-hour resolution; P3 (low impact) may allow 24-hour response and 5-day resolution.

Q: How do you implement skill-based routing for support tickets?

Maintain an agent_skills table (agent_id, skill_tag, proficiency_level) and tag each ticket with required skills extracted from category or keywords. On ticket creation, query for available agents (status = 'available', current_load open -> pending_customer -> pending_internal -> resolved -> closed. Transitions are triggered by explicit events: agent_assigned (new->open), awaiting_customer_reply (open->pending_customer), customer_replied (pending_customer->open), resolution_posted (open->resolved), auto_close after 72h with no reply (resolved->closed). Enforce transitions server-side with a guard table listing valid (from_state, to_state) pairs; reject invalid transitions with a 422 error. Persist every state change to ticket_events for a full audit trail.

⏱ 6 min read

A customer support ticket system coordinates incoming requests across channels, routes them to the right agents, enforces SLAs, and provides management visibility. Here is a detailed low level design.

Ticket Schema

The central entity is the ticket:

tickets
-------
id                BIGINT PK
subject           VARCHAR(255)
description       TEXT
channel           ENUM('email','chat','web')
priority          ENUM('P1','P2','P3')
status            ENUM('new','open','pending','resolved','closed')
assignee_id       BIGINT FK → agents
requester_id      BIGINT FK → users
created_at        TIMESTAMP
first_response_at TIMESTAMP NULL
resolved_at       TIMESTAMP NULL
category          VARCHAR(100)

Every state transition is appended to a ticket_events table (ticket_id, event_type, actor_id, created_at, payload JSON) so the full audit trail is preserved without mutating the main row beyond status fields.

Routing Engine

When a ticket is created, the routing engine assigns it to an agent:

Skill-based routing: match ticket category to agent skill tags. Filter to agents whose skills include the ticket category.
Load balancing: among eligible agents, pick the one with the fewest open tickets.
Round-robin fallback: if all eligible agents have equal load, rotate assignment to avoid starvation.

Agent availability (online/offline/busy) is tracked in a Redis hash updated by the agent desktop application via heartbeat. The router only considers agents whose heartbeat is fresh (within 60 seconds).

SLA Rules and Breach Detection

Response time targets by priority:

P1 → first response within 1 hour
P2 → first response within 4 hours
P3 → first response within 24 hours

On ticket creation, a scheduled job entry is written to an sla_timers table: (ticket_id, due_at, type: first_response). A cron job runs every minute, queries for timers where due_at < NOW() and the corresponding ticket still has a NULL first_response_at, and marks those tickets as SLA-breached. A second timer tracks resolution SLA similarly.

Escalation Rules

When the SLA breach job detects a breach, it fires an escalation workflow: the ticket priority is upgraded one level (P3 → P2, P2 → P1), the ticket is reassigned to a senior agent or team lead, and a notification is sent to the assigned agent and their manager. Escalation history is logged in ticket_events. P1 breaches trigger an immediate PagerDuty alert.

Agent Workflow

The ticket lifecycle from an agent perspective:

Claim: agent opens ticket, status moves from new to open, first_response_at is set on first reply.
Update: agent adds internal notes or public replies. Each reply is a row in ticket_comments.
Resolve: agent marks resolved. Status → resolved, resolved_at set.
Close: after a configurable hold period (e.g., 48h with no requester activity), status auto-advances to closed. Requester can reopen within that window.

Knowledge Base Integration

On ticket creation, the subject and first 200 characters of description are sent to a search service (Elasticsearch or a simple TF-IDF index over KB articles). The top 3 matching articles are returned and displayed to the requester in the ticket portal as suggested self-service answers, reducing unnecessary ticket volume. Agents see the same suggestions in their interface when composing replies.

CSAT and Reporting

On ticket close, a CSAT survey is emailed to the requester with a one-click 1–5 rating link. The score is stored in csat_responses (ticket_id, score, comment, submitted_at).

The reporting dashboard aggregates: average CSAT by agent and team, median first response time vs. SLA target, median resolution time, ticket volume by category and channel, and SLA breach rate. Queries run against a read replica or a pre-aggregated reporting table refreshed hourly.

Frequently Asked Questions

Q: How do you model SLA tiers (P1, P2, P3) in a support ticket system?

A: Define an sla_policy table with columns (priority, first_response_minutes, resolution_minutes, business_hours_only). When a ticket is created, look up the matching policy and persist sla_breach_at = created_at + response_window on the ticket record. A background scheduler (e.g., a cron job every minute) queries for tickets where sla_breach_at < NOW() AND status != ‘resolved’ and fires escalation events. P1 tickets (critical outages) typically target 15-minute response and 4-hour resolution; P3 (low impact) may allow 24-hour response and 5-day resolution.

Q: How do you implement skill-based routing for support tickets?

A: Maintain an agent_skills table (agent_id, skill_tag, proficiency_level) and tag each ticket with required skills extracted from category or keywords. On ticket creation, query for available agents (status = ‘available’, current_load < max_load) whose skill set is a superset of the ticket’s required tags, then rank by proficiency and current queue depth. Use a weighted round-robin or least-connections algorithm as a tiebreaker. Publish the assignment event to a message queue so the chosen agent receives a real-time push notification.

Q: How do you design ticket escalation rules?

A: Model escalation as a finite set of rules: (priority, condition, action, target_tier). Conditions include SLA breach imminent (e.g., 80% of SLA window elapsed), no agent response after N minutes, or customer sentiment score below threshold. Store escalation history in a ticket_events table with timestamps. When a rule fires, update ticket.priority, reassign to a senior agent pool, and notify a manager via webhook. Cap escalation depth (e.g., max 3 levels) to avoid infinite loops.

Q: How do you collect CSAT scores after ticket resolution?

A: When a ticket transitions to ‘resolved’, enqueue a delayed job (e.g., 2 hours later) that sends a CSAT survey email with a tokenised one-click rating link (1-5 stars). The token encodes (ticket_id, customer_id, expiry) signed with HMAC so it cannot be forged or reused. On click, persist the rating and optional comment to a csat_responses table. Aggregate scores per agent, team, and time window for reporting. Suppress the survey if the ticket was reopened before the delay fires.

Q: How do you model the ticket state machine?

A: Define states: new -> open -> pending_customer -> pending_internal -> resolved -> closed. Transitions are triggered by explicit events: agent_assigned (new->open), awaiting_customer_reply (open->pending_customer), customer_replied (pending_customer->open), resolution_posted (open->resolved), auto_close after 72h with no reply (resolved->closed). Enforce transitions server-side with a guard table listing valid (from_state, to_state) pairs; reject invalid transitions with a 422 error. Persist every state change to ticket_events for a full audit trail.