What Is an Online Judge?
An online judge (LeetCode, HackerRank, Codeforces) accepts code submissions in multiple languages, executes them against test cases, and returns results: Accepted, Wrong Answer, Time Limit Exceeded, Memory Limit Exceeded, Runtime Error, or Compilation Error. The core challenges: safe execution of untrusted code (sandboxing), scalability under submission spikes, and accurate verdict computation.
Code Execution and Sandboxing
Untrusted user code can: fork bomb the system, read sensitive files, make network calls, or consume unbounded memory. Isolation layers: (1) Container isolation: run each submission in a Docker container with no network access, read-only filesystem, and resource limits (CPU: 1 core, memory: 256MB). (2) Seccomp (Secure Computing Mode): whitelist only the system calls needed for computation (read, write, exit) — block fork, exec, socket, open. (3) Namespace isolation: separate PID, network, and mount namespaces. (4) Time limit enforcement: use SIGALRM or cgroups CPU quota to kill the process after the time limit (e.g., 2 seconds). (5) Memory limit: cgroups memory.limit_in_bytes kills the process on OOM. Multiple layers provide defense in depth — even if the container is compromised, seccomp prevents dangerous syscalls.
Execution Pipeline
Submission flow: user submits code -> API server validates (language supported, code length enqueue to a message queue (Kafka or SQS) per language -> judge worker picks up the job -> worker spins up a Docker container -> compiles (if compiled language) -> runs against each test case -> collects results -> sends verdict back via a result queue -> API server stores result and updates the user’s submission history.
Test cases: each problem has N test cases (typically 50-200). Run all test cases and return the first failure (or Accepted if all pass). For efficiency: run a few lightweight test cases first (fast feedback). Run heavy test cases last. Test cases are stored in object storage (S3) — workers download them at job start. Cache popular problem test cases in worker local storage.
Language Support
Interpreted languages (Python, JavaScript): compile step is skipped. Execute directly. Compiled languages (C++, Java, Go): compile first, report Compilation Error if it fails, then run the binary. Per-language containers: each language has a dedicated base Docker image with the compiler/runtime pre-installed (warm start). Container pooling: pre-warm N containers per language to avoid cold start overhead on each submission. Return containers to the pool after execution (reset the filesystem). Language-specific time limit adjustments: Python is 3x slower than C++ for the same algorithm — set per-language time limits (C++ 1s, Python 3s).
Scalability
Contest mode: thousands of simultaneous submissions (start of a contest). Scale judge workers horizontally: auto-scale the worker pool based on queue depth. Separate queues per language — prevents a Python submission spike from delaying C++ submissions. Priority queue: submissions for paid users or during contests get higher priority. Judge worker isolation: each worker can only run one submission at a time (CPU-bound) — over-scheduling degrades performance for all. Typical sizing: 1 worker core = 10 submissions/minute. For 1000 submissions/minute: 100 worker cores minimum. Use spot instances for judge workers (70% cheaper, acceptable eviction rate with job re-enqueue).
Result Delivery
Async results: submissions are processed asynchronously. The frontend polls or uses WebSocket to receive the verdict when ready. Client-side: show “Judging…” with progress updates. Server push: when the verdict is ready, push via WebSocket to the client’s browser. Store all submissions in a database: (submission_id, user_id, problem_id, language, code, verdict, runtime_ms, memory_mb, submitted_at). User can view their submission history and replay any submission.
Interview Tips
- Sandboxing is the core design challenge. Mention at least two isolation layers (Docker + seccomp) — single-layer isolation is insufficient for truly untrusted code.
- The job queue + worker pool pattern is standard. Emphasize that workers are stateless (any worker can handle any submission) — this enables horizontal scaling.
- Test case management: test cases are the intellectual property of the platform. Store encrypted in S3; workers decrypt locally. Do not transmit test case outputs to clients (prevents reverse-engineering).
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you sandbox untrusted code execution safely?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Multiple isolation layers are required because any single layer can be bypassed. Layer 1 — Container: run code inside a Docker container with no network access (–network none), read-only root filesystem, and a tmpfs /tmp. Limits CPU and memory via cgroups. Layer 2 — Seccomp: whitelist only the system calls needed for computation: read, write, mmap, exit, brk, futex. Block fork (prevents fork bombs), socket (prevents network), execve (prevents spawning new processes), open with write flags (prevents file modification). Layer 3 — Namespaces: separate PID namespace (process cannot see host processes), separate mount namespace (cannot access host filesystem). Layer 4 — User namespace: run as an unprivileged user inside the container (uid 1000) even if the container process appears as root inside the namespace. Each layer independently prevents different attack classes.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement time and memory limits for code execution?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Time limit: use cgroups CPU quota (cpu.cfs_quota_us) to limit CPU time. For a 2-second limit: set quota to 2 CPU-seconds. The process is killed when quota is exceeded. Alternatively: set a SIGALRM timer in the runner process (signal after 2 real seconds). Wall clock limit is simpler but unfair if the machine is under load. CPU time limit is more precise for fairness. Memory limit: cgroups memory.limit_in_bytes. Set to the problem’s memory limit (e.g., 256MB). When exceeded: the process receives SIGKILL (OOM killer). The runner detects the OOM exit code and reports Memory Limit Exceeded. Stack size: setrlimit RLIMIT_STACK (typically 8MB default, can be reduced). Combined: set both CPU and wall clock limits (whichever fires first) to handle I/O-heavy programs that wait without consuming CPU but exceed the real-time expectation.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle test case management and prevent cheating?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Test cases are the platform’s intellectual property. Storage: encrypted in S3 (AES-256). Workers decrypt using a key stored in AWS Secrets Manager — only judge workers have access, not the API servers. Test case transmission: test cases are never sent to clients. Even if a client intercepts network traffic, they see encrypted data. Output checking: instead of comparing raw output strings, use a custom checker for problems with multiple valid outputs (e.g., shortest path problems where multiple paths have the same length). The checker is a program that takes (input, expected_output, user_output) and returns OK or WRONG. Anti-hack: do not include test cases in error messages. On Wrong Answer, show the first failing test case input (not the expected output) to help debugging without revealing answers. For premium users: show the expected output as a benefit.”
}
},
{
“@type”: “Question”,
“name”: “How does an online judge handle concurrent submissions during a contest?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A contest with 10,000 participants submitting simultaneously creates significant load spikes. Architecture: submit to Kafka (durable, handles the burst). Judge workers are auto-scaled based on queue depth (CloudWatch metric: ApproximateNumberOfMessages). Each worker processes one submission at a time (CPU-bound — no value in concurrency per worker). Target: process backlog within 30 seconds for a good user experience. Separate queues by priority: contest submissions (high priority), regular practice (lower priority). Separate queues by language: Python queue, C++ queue, Java queue. This prevents a flood of slow Python submissions from delaying fast C++ submissions. Worker sizing: a C++ submission takes ~2 seconds. A Python submission takes ~5 seconds. Plan worker count accordingly. Pre-warm workers before a contest: scale up 30 minutes early.”
}
},
{
“@type”: “Question”,
“name”: “How do you compute verdicts correctly for edge cases?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Verdict accuracy: (1) Time Limit Exceeded vs Accepted: measure CPU time, not wall clock. A slow machine should not give TLE to a correct solution. Use getrusage() or /proc/self/stat for CPU time measurement. (2) Memory Limit Exceeded: read peak RSS (Resident Set Size) from /proc/self/status or cgroups memory.max_usage_in_bytes after execution. (3) Runtime Error: catch all non-zero exit codes and signal terminations (SIGSEGV, SIGABRT). Map to Runtime Error. (4) Wrong Answer: compare output exactly (or via checker). Strip trailing whitespace and newlines before comparison — many correct solutions output trailing newlines. (5) Output Limit Exceeded: cap output at 4MB. Kill the process if output exceeds the cap. This prevents a program from writing a terabyte to stdout. (6) Compilation Error: compile with a timeout (30 seconds). Capture and return the compiler error message to the user.”
}
}
]
}
Asked at: Meta Interview Guide
Asked at: Coinbase Interview Guide
Asked at: Databricks Interview Guide
Asked at: Cloudflare Interview Guide