System Design: Online Judge — Code Execution, Sandboxing, Test Cases, and Scalable Evaluation

What Is an Online Judge?

An online judge (LeetCode, HackerRank, Codeforces) accepts code submissions in multiple languages, executes them against test cases, and returns results: Accepted, Wrong Answer, Time Limit Exceeded, Memory Limit Exceeded, Runtime Error, or Compilation Error. The core challenges: safe execution of untrusted code (sandboxing), scalability under submission spikes, and accurate verdict computation.

Code Execution and Sandboxing

Untrusted user code can: fork bomb the system, read sensitive files, make network calls, or consume unbounded memory. Isolation layers: (1) Container isolation: run each submission in a Docker container with no network access, read-only filesystem, and resource limits (CPU: 1 core, memory: 256MB). (2) Seccomp (Secure Computing Mode): whitelist only the system calls needed for computation (read, write, exit) — block fork, exec, socket, open. (3) Namespace isolation: separate PID, network, and mount namespaces. (4) Time limit enforcement: use SIGALRM or cgroups CPU quota to kill the process after the time limit (e.g., 2 seconds). (5) Memory limit: cgroups memory.limit_in_bytes kills the process on OOM. Multiple layers provide defense in depth — even if the container is compromised, seccomp prevents dangerous syscalls.

Execution Pipeline

Submission flow: user submits code -> API server validates (language supported, code length enqueue to a message queue (Kafka or SQS) per language -> judge worker picks up the job -> worker spins up a Docker container -> compiles (if compiled language) -> runs against each test case -> collects results -> sends verdict back via a result queue -> API server stores result and updates the user’s submission history.

Test cases: each problem has N test cases (typically 50-200). Run all test cases and return the first failure (or Accepted if all pass). For efficiency: run a few lightweight test cases first (fast feedback). Run heavy test cases last. Test cases are stored in object storage (S3) — workers download them at job start. Cache popular problem test cases in worker local storage.

Language Support

Interpreted languages (Python, JavaScript): compile step is skipped. Execute directly. Compiled languages (C++, Java, Go): compile first, report Compilation Error if it fails, then run the binary. Per-language containers: each language has a dedicated base Docker image with the compiler/runtime pre-installed (warm start). Container pooling: pre-warm N containers per language to avoid cold start overhead on each submission. Return containers to the pool after execution (reset the filesystem). Language-specific time limit adjustments: Python is 3x slower than C++ for the same algorithm — set per-language time limits (C++ 1s, Python 3s).

Scalability

Contest mode: thousands of simultaneous submissions (start of a contest). Scale judge workers horizontally: auto-scale the worker pool based on queue depth. Separate queues per language — prevents a Python submission spike from delaying C++ submissions. Priority queue: submissions for paid users or during contests get higher priority. Judge worker isolation: each worker can only run one submission at a time (CPU-bound) — over-scheduling degrades performance for all. Typical sizing: 1 worker core = 10 submissions/minute. For 1000 submissions/minute: 100 worker cores minimum. Use spot instances for judge workers (70% cheaper, acceptable eviction rate with job re-enqueue).

Result Delivery

Async results: submissions are processed asynchronously. The frontend polls or uses WebSocket to receive the verdict when ready. Client-side: show “Judging…” with progress updates. Server push: when the verdict is ready, push via WebSocket to the client’s browser. Store all submissions in a database: (submission_id, user_id, problem_id, language, code, verdict, runtime_ms, memory_mb, submitted_at). User can view their submission history and replay any submission.

Interview Tips

  • Sandboxing is the core design challenge. Mention at least two isolation layers (Docker + seccomp) — single-layer isolation is insufficient for truly untrusted code.
  • The job queue + worker pool pattern is standard. Emphasize that workers are stateless (any worker can handle any submission) — this enables horizontal scaling.
  • Test case management: test cases are the intellectual property of the platform. Store encrypted in S3; workers decrypt locally. Do not transmit test case outputs to clients (prevents reverse-engineering).

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you sandbox untrusted code execution safely?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Multiple isolation layers are required because any single layer can be bypassed. Layer 1 — Container: run code inside a Docker container with no network access (–network none), read-only root filesystem, and a tmpfs /tmp. Limits CPU and memory via cgroups. Layer 2 — Seccomp: whitelist only the system calls needed for computation: read, write, mmap, exit, brk, futex. Block fork (prevents fork bombs), socket (prevents network), execve (prevents spawning new processes), open with write flags (prevents file modification). Layer 3 — Namespaces: separate PID namespace (process cannot see host processes), separate mount namespace (cannot access host filesystem). Layer 4 — User namespace: run as an unprivileged user inside the container (uid 1000) even if the container process appears as root inside the namespace. Each layer independently prevents different attack classes.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement time and memory limits for code execution?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Time limit: use cgroups CPU quota (cpu.cfs_quota_us) to limit CPU time. For a 2-second limit: set quota to 2 CPU-seconds. The process is killed when quota is exceeded. Alternatively: set a SIGALRM timer in the runner process (signal after 2 real seconds). Wall clock limit is simpler but unfair if the machine is under load. CPU time limit is more precise for fairness. Memory limit: cgroups memory.limit_in_bytes. Set to the problem’s memory limit (e.g., 256MB). When exceeded: the process receives SIGKILL (OOM killer). The runner detects the OOM exit code and reports Memory Limit Exceeded. Stack size: setrlimit RLIMIT_STACK (typically 8MB default, can be reduced). Combined: set both CPU and wall clock limits (whichever fires first) to handle I/O-heavy programs that wait without consuming CPU but exceed the real-time expectation.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle test case management and prevent cheating?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Test cases are the platform’s intellectual property. Storage: encrypted in S3 (AES-256). Workers decrypt using a key stored in AWS Secrets Manager — only judge workers have access, not the API servers. Test case transmission: test cases are never sent to clients. Even if a client intercepts network traffic, they see encrypted data. Output checking: instead of comparing raw output strings, use a custom checker for problems with multiple valid outputs (e.g., shortest path problems where multiple paths have the same length). The checker is a program that takes (input, expected_output, user_output) and returns OK or WRONG. Anti-hack: do not include test cases in error messages. On Wrong Answer, show the first failing test case input (not the expected output) to help debugging without revealing answers. For premium users: show the expected output as a benefit.”
}
},
{
“@type”: “Question”,
“name”: “How does an online judge handle concurrent submissions during a contest?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A contest with 10,000 participants submitting simultaneously creates significant load spikes. Architecture: submit to Kafka (durable, handles the burst). Judge workers are auto-scaled based on queue depth (CloudWatch metric: ApproximateNumberOfMessages). Each worker processes one submission at a time (CPU-bound — no value in concurrency per worker). Target: process backlog within 30 seconds for a good user experience. Separate queues by priority: contest submissions (high priority), regular practice (lower priority). Separate queues by language: Python queue, C++ queue, Java queue. This prevents a flood of slow Python submissions from delaying fast C++ submissions. Worker sizing: a C++ submission takes ~2 seconds. A Python submission takes ~5 seconds. Plan worker count accordingly. Pre-warm workers before a contest: scale up 30 minutes early.”
}
},
{
“@type”: “Question”,
“name”: “How do you compute verdicts correctly for edge cases?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Verdict accuracy: (1) Time Limit Exceeded vs Accepted: measure CPU time, not wall clock. A slow machine should not give TLE to a correct solution. Use getrusage() or /proc/self/stat for CPU time measurement. (2) Memory Limit Exceeded: read peak RSS (Resident Set Size) from /proc/self/status or cgroups memory.max_usage_in_bytes after execution. (3) Runtime Error: catch all non-zero exit codes and signal terminations (SIGSEGV, SIGABRT). Map to Runtime Error. (4) Wrong Answer: compare output exactly (or via checker). Strip trailing whitespace and newlines before comparison — many correct solutions output trailing newlines. (5) Output Limit Exceeded: cap output at 4MB. Kill the process if output exceeds the cap. This prevents a program from writing a terabyte to stdout. (6) Compilation Error: compile with a timeout (30 seconds). Capture and return the compiler error message to the user.”
}
}
]
}

Asked at: Meta Interview Guide

Asked at: Coinbase Interview Guide

Asked at: Databricks Interview Guide

Asked at: Cloudflare Interview Guide

Scroll to Top