Probability Distribution Functions: PMF, PDF, CDF, and the Common Distributions
Understanding probability distribution functions is foundational for quant trading interviews, ML / statistics interviews, and any role involving uncertainty modeling. The standard objects — PMF, PDF, CDF, expectation, variance — appear constantly. This guide covers the definitions, the relationships between them, and the standard distributions (normal, exponential, Poisson, geometric, binomial) that interviewers expect candidates to recognize and reason about.
Probability Mass Function (PMF) — Discrete
For a discrete random variable X taking values in a countable set, the PMF gives the probability of each outcome:
PMF: p(x) = P(X = x)
Properties:
- p(x) ≥ 0 for all x
- Σ p(x) = 1 (sum over all possible values)
Example: a fair die has PMF p(x) = 1/6 for x ∈ {1, 2, 3, 4, 5, 6}.
Probability Density Function (PDF) — Continuous
For a continuous random variable, single points have probability 0; instead, the PDF gives a density:
PDF: f(x), where P(a ≤ X ≤ b) = ∫[a..b] f(x) dx
Properties:
- f(x) ≥ 0 for all x
- ∫[-∞..∞] f(x) dx = 1
- f(x) is NOT a probability — it can exceed 1
Example: Standard normal has PDF f(x) = (1/√(2π)) × exp(-x²/2). At x = 0, f(0) ≈ 0.399 — a density, not a probability.
Cumulative Distribution Function (CDF)
For both discrete and continuous random variables, the CDF gives the probability that X is at most some value:
CDF: F(x) = P(X ≤ x)
For discrete: F(x) = Σ p(y) for all y ≤ x.
For continuous: F(x) = ∫[-∞..x] f(t) dt.
Properties:
- F is non-decreasing
- F(-∞) = 0, F(∞) = 1
- F is right-continuous
The CDF is universal — works for any random variable, discrete or continuous, mixed or singular.
Expectation and Variance
Expectation (mean):
- Discrete: E[X] = Σ x × p(x)
- Continuous: E[X] = ∫ x × f(x) dx
Variance:
Var(X) = E[(X – E[X])²] = E[X²] – E[X]²
Standard deviation: σ = √Var(X), in the same units as X.
Common Distributions to Know
Bernoulli
Single trial with success probability p. PMF: p(1) = p, p(0) = 1-p. E[X] = p; Var(X) = p(1-p).
Use case: Coin flip, success/failure events.
Binomial
Sum of n independent Bernoulli trials. PMF: p(k) = C(n, k) × p^k × (1-p)^(n-k). E[X] = np; Var(X) = np(1-p).
Use case: Number of heads in n flips, conversion counts in A/B tests.
Geometric
Number of Bernoulli trials until first success. PMF: p(k) = (1-p)^(k-1) × p for k = 1, 2, …. E[X] = 1/p; Var(X) = (1-p)/p².
Use case: First failure after k successes, gambler’s ruin variant.
Poisson
Number of events in fixed interval, rate λ. PMF: p(k) = λ^k × e^(-λ) / k!. E[X] = λ; Var(X) = λ.
Use case: Arrivals in queueing models, rare events, network packet counts.
Uniform (Continuous)
Equal density on [a, b]. PDF: f(x) = 1/(b-a) for a ≤ x ≤ b, else 0. E[X] = (a+b)/2; Var(X) = (b-a)²/12.
Use case: rand() in most languages, baseline for sampling.
Exponential
Time until first arrival in a Poisson process, rate λ. PDF: f(x) = λ × e^(-λx) for x ≥ 0. E[X] = 1/λ; Var(X) = 1/λ².
Use case: Time-until-event modeling, memoryless processes (network packet arrival, radioactive decay).
Normal (Gaussian)
Bell curve. PDF: f(x) = (1/(σ√(2π))) × exp(-(x-μ)²/(2σ²)). E[X] = μ; Var(X) = σ².
Use case: Most aggregate statistics (Central Limit Theorem), measurement errors, asset returns at long horizons.
Lognormal
X is lognormal if ln(X) is normal. Skewed right; common for asset prices, income distributions.
Memorylessness Property
The geometric and exponential distributions are memoryless: P(X > s + t | X > s) = P(X > t). The waiting time doesn’t depend on how long you’ve already waited.
This is unique to these two distributions. In real-world processes, memorylessness is rarely true exactly, but exponential is often a good approximation for short-horizon events.
Common Interview Problems
Computing expectation
“X is uniform on [0, 1]. What’s E[X²]?” Answer: ∫₀¹ x² dx = 1/3. Strong candidates do this without paper.
Variance under transformation
“If Var(X) = σ², what’s Var(aX + b)?” Answer: a²σ². Variance is invariant to constant shift; scales by the square of multiplier.
Sum of independent random variables
“X and Y are independent. Var(X + Y) = ?” Answer: Var(X) + Var(Y). Without independence, you’d need covariance terms.
Recognize the distribution
“You count k events in time t, where each event is independent. What distribution is this?” Poisson with λ = expected count.
Compute a tail probability
“X is exponential with rate 1. What’s P(X > 2)?” Answer: e^(-2) ≈ 0.135. Use the survival function (1 – CDF).
Common Mistakes
- Treating PDF values as probabilities. f(x) is a density, not a probability. P(X = x) = 0 for any single point in a continuous distribution.
- Forgetting the area-under-curve = 1 constraint. A function isn’t a valid PDF unless it integrates to 1.
- Confusing variance and standard deviation. SD = √Var. SD is in the original units; variance is squared.
- Misapplying independence. Sum of variances applies only for independent random variables. With covariance, Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y).
- Mixing discrete and continuous. Interval [a, b] for continuous includes both endpoints; for discrete, you need to specify whether endpoints count. The CDF unifies them, but PMF/PDF behavior differs.
Frequently Asked Questions
What’s the practical difference between PMF and PDF?
PMF gives a probability at each outcome (discrete). PDF gives a density (continuous; integrate to get probability over an interval). The CDF unifies them. For interviews: know which type the problem implies, and use the right formulas.
Why is the normal distribution so important?
Central Limit Theorem: sums of many independent random variables converge to normal. This makes normal a good approximation for aggregate statistics across diverse domains. In quant finance specifically, normality is assumed in many models (Black-Scholes, mean-variance portfolio optimization) even when returns are leptokurtic in practice.
What does memorylessness mean and why does it matter?
Future behavior doesn’t depend on past history (given a current state). Exponential and geometric have this property; almost no other distributions do. Important in queueing theory and Markov chain analysis where memorylessness simplifies the math considerably.
How do I quickly compute moments?
For known distributions, memorize the formulas (E[X] and Var(X)). For functions of random variables, use moment-generating functions (MGFs) or characteristic functions when the algebra gets ugly. Most interview problems can be solved with basic E[X] and Var(X) formulas plus linearity of expectation.
What books should I use for distribution prep?
Sheldon Ross’s Introduction to Probability Models is the classical text. For quant-flavored, Zhou’s “green book” (A Practical Guide to Quantitative Finance Interviews) has many distribution-related problems. Stat 110 by Joe Blitzstein (Harvard, free online lectures) covers the same material at lecture pace. Most candidates over-prepare here; familiarity with the basic distributions and how to compute moments is sufficient.
See also: Expected Value • Conditional Probability • Random Walks and Stopping Times
💡Strategies for Solving This Problem
Statistics and Sampling
Got this at Two Sigma in 2024. Tests understanding of probability distributions, sampling, and implementing statistical functions from scratch. Common in quantitative trading interviews.
The Problem
Implement a function that samples from a custom probability distribution given as discrete probabilities.
Example: Given array [0.1, 0.3, 0.4, 0.2], return index 0 with 10% probability, index 1 with 30%, index 2 with 40%, index 3 with 20%.
Approach 1: Linear Search
Generate random number [0, 1). Walk through array adding probabilities until sum exceeds random number.
Algorithm:
probs = [0.1, 0.3, 0.4, 0.2]
r = random() // e.g. 0.65
sum = 0
for i, p in probs:
sum += p
if sum >= r:
return i
Example: r=0.65
- 0.1 < 0.65, continue
- 0.1 + 0.3 = 0.4 < 0.65, continue
- 0.4 + 0.4 = 0.8 >= 0.65, return index 2 ✓
Time: O(n) per sample
Approach 2: Binary Search with CDF (Optimal) ✓
Pre-compute cumulative distribution function (CDF), then use binary search.
CDF: [0.1, 0.4, 0.8, 1.0]
To sample:
- Generate r = random()
- Binary search CDF for first value >= r
- Return that index
Time: O(n) setup, O(log n) per sample
Much better when sampling multiple times from same distribution.
Approach 3: Alias Method (Advanced)
Pre-process into O(n) space structure that allows O(1) sampling. Complex but optimal for many samples.
Used in high-frequency systems where sampling speed is critical.
Edge Cases
- Probabilities don't sum to 1: Normalize first
- Zero probabilities: Skip in CDF
- Floating point errors: Use epsilon comparisons
- Empty array: Error or return null
- Single element: Always return 0
At Two Sigma
I initially did linear search. Interviewer said "You'll sample millions of times. Can you do better?" Then I did CDF with binary search. He asked about space-time tradeoff and mentioned alias method. We discussed when each approach is best.