Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

In the previous chapter, we defined discrete random variables and learned how to describe their behavior using Probability Mass Functions (PMFs), Cumulative Distribution Functions (CDFs), expected value, and variance. While we can define custom PMFs for any situation, several specific discrete distributions appear so frequently in practice that they have been studied extensively and given names.

These “common” distributions serve as powerful models for a wide variety of real-world processes. Understanding their properties and when to apply them is crucial for probabilistic modeling. In this chapter, we will explore nine fundamental discrete distributions: Bernoulli, Binomial, Geometric, Negative Binomial, Poisson, Hypergeometric, Discrete Uniform, Categorical, and Multinomial.

We’ll examine the scenarios each distribution models, their key characteristics (PMF, mean, variance), and how to work with them efficiently using Python’s scipy.stats library. This library provides tools to calculate probabilities (PMF, CDF), generate random samples, and more, significantly simplifying our practical work.

1. Bernoulli Distribution

The Bernoulli distribution models a single trial with two possible outcomes: “success” (1) or “failure” (0).

Concrete Example

Suppose you’re conducting a medical screening test for a disease in a high-risk population. Each test either shows positive or negative. From epidemiological data, you know that 30% of individuals in this population test positive.

We model this with a random variable XX:

The probabilities are:

The Bernoulli PMF

For any Bernoulli random variable with success probability pp, the PMF is:

P(X=k)={pif k=11pif k=00otherwiseP(X=k) = \begin{cases} p & \text{if } k=1 \\ 1-p & \text{if } k=0 \\ 0 & \text{otherwise} \end{cases}

This can also be written compactly as:

P(X=k)=pk(1p)1k for k{0,1}P(X = k) = p^k (1-p)^{1-k} \text{ for } k \in \{0, 1\}

Expanding this for both cases to make it crystal clear:

When k = 1 (success):

P(X=1)=p1(1p)11=p1(1p)0=p×1=p\begin{align} P(X=1) &= p^1 (1-p)^{1-1} \\ &= p^1 (1-p)^0 \\ &= p \times 1 \\ &= p \end{align}

When k = 0 (failure):

P(X=0)=p0(1p)10=p0(1p)1=1×(1p)=1p\begin{align} P(X=0) &= p^0 (1-p)^{1-0} \\ &= p^0 (1-p)^1 \\ &= 1 \times (1-p) \\ &= 1-p \end{align}

Let’s verify this works for our example where p=0.3p = 0.3:

Key Characteristics

Mean: E[X]=pE[X] = p

Variance: Var(X)=p(1p)Var(X) = p(1-p)

Standard Deviation: SD(X)=p(1p)SD(X) = \sqrt{p(1-p)}

Visualizing the Distribution

Let’s visualize a Bernoulli distribution with p=0.3p = 0.3 (our medical test example from above):

<Figure size 1000x400 with 1 Axes>

The PMF shows two bars: P(X=0) = 0.7 for a negative test and P(X=1) = 0.3 for a positive test. The red dashed line marks the mean (p=0.3p = 0.3), and the orange shaded region shows mean ± 1 standard deviation.

<Figure size 1000x400 with 1 Axes>

The CDF shows the step function: starts at 0 for x < 0, jumps to 0.7 at x=0 (the value when outcome is 0), stays flat at 0.7 until x=1, then jumps to 1.0 at x=1 (the value when including both outcomes 0 and 1). The red dashed line marks the mean.

Note: Here, P(X ≤ 0) = P(X = 0) = 0.7 because X can’t take negative values; in general, “X ≤ 0” means “at or below 0”, not “exactly 0”.

Reading the PMF

Reading the CDF

Note on CDF visualization: The charts use where='post' in the step plot to create proper right-continuous step functions. This means the CDF jumps up at each value and includes that value in the cumulative probability.

Quick Check Questions

  1. A quality control inspector checks a single product. It’s either defective or not defective. Is this scenario well-modeled by a Bernoulli distribution? Why or why not?

  1. For a Bernoulli distribution with p = 0.3, what is P(X = 0)?

  1. A basketball player has a 75% free throw success rate. If we model a single free throw as a Bernoulli trial, what are the mean and variance?

  1. You roll a six-sided die once. Is this well-modeled by a Bernoulli distribution?

  1. True or False: A Bernoulli random variable can only take on the values 0 and 1.

2. Binomial Distribution

The Binomial distribution models the number of successes in a fixed number of independent Bernoulli trials, where each trial has the same probability of success.

Concrete Example

Suppose you flip a fair coin 10 times. Each flip is a Bernoulli trial with p = 0.5 (probability of heads). How many heads will you get?

We model this with a random variable XX:

The probabilities are:

The Binomial PMF

For nn independent trials with success probability pp:

P(X=k)=(nk)pk(1p)nkP(X=k) = \binom{n}{k} p^k (1-p)^{n-k}

for k=0,1,,nk = 0, 1, \dots, n

where (nk)=n!k!(nk)!\binom{n}{k} = \frac{n!}{k!(n-k)!} is the binomial coefficient (number of ways to choose kk successes from nn trials).

Let’s verify this works for our coin flip example (n=10, p=0.5):

P(X=5)=(105)p5(1p)105=(105)(0.5)5(10.5)5=(105)(0.5)5(0.5)5=252×0.03125×0.031250.246\begin{align} P(X=5) &= \binom{10}{5} p^5 (1-p)^{10-5} \\ &= \binom{10}{5} (0.5)^5 (1-0.5)^5 \\ &= \binom{10}{5} (0.5)^5 (0.5)^5 \\ &= 252 \times 0.03125 \times 0.03125 \\ &\approx 0.246 \quad \checkmark \end{align}

Key Characteristics

Mean: E[X]=npE[X] = np

Variance: Var(X)=np(1p)Var(X) = np(1-p)

Standard Deviation: SD(X)=np(1p)SD(X) = \sqrt{np(1-p)}

Visualizing the Distribution

Let’s visualize a Binomial distribution with n=10n = 10 and p=0.5p = 0.5 (our coin flip example):

<Figure size 1000x400 with 1 Axes>

The PMF shows the probability distribution for the number of heads in 10 coin flips. The distribution is symmetric around the mean (np=5np = 5) since p=0.5p = 0.5. The shaded region shows mean ± 1 standard deviation (np(1p)=2.51.58\sqrt{np(1-p)} = \sqrt{2.5} \approx 1.58).

<Figure size 1000x400 with 1 Axes>

The CDF shows P(X ≤ k), the cumulative probability of getting k or fewer heads. The red dashed line marks the mean.

Quick Check Questions

  1. You roll a die 12 times and count how many times you get a 6. Is this a good fit for the Binomial distribution? Why or why not?

  1. For a Binomial distribution with n = 8 and p = 0.25, what is the expected value (mean)?

  1. A basketball player has a 70% free throw success rate. You watch her take 15 free throws. Does this scenario fit the Binomial distribution assumptions?

  1. For a Binomial(n=20, p=0.3) distribution, what is the variance?

  1. True or False: In a Binomial distribution, each trial must have the same probability of success.

3. Geometric Distribution

The Geometric distribution models the number of independent Bernoulli trials needed to get the first success.

Concrete Example

You’re shooting free throws until you make your first basket. Each shot has a 0.4 probability of success. How many shots will it take to make your first basket?

We model this with a random variable XX:

The probabilities are:

The Geometric PMF

For trials with success probability pp:

P(X=k)=(1p)k1pP(X=k) = (1-p)^{k-1} p

for k=1,2,3,k = 1, 2, 3, \dots

This means k1k-1 failures followed by one success.

Let’s verify for our example (p=0.4):

Visual example: Here’s how the geometric distribution works with p=0.4p=0.4 (our free throw example):

<Figure size 1400x1000 with 1 Axes>

The diagram shows how the geometric distribution works: each additional failure before success makes the outcome less likely. The probability decreases exponentially - notice how P(X=1) = 0.4000 is much larger than P(X=5) = 0.0518.

Key Characteristics

Mean: E[X]=1pE[X] = \frac{1}{p}

Variance: Var(X)=1pp2Var(X) = \frac{1-p}{p^2}

Standard Deviation: SD(X)=1ppSD(X) = \frac{\sqrt{1-p}}{p}

Relationship to Other Distributions: The Geometric distribution is built from independent Bernoulli trials and is a special case of the Negative Binomial distribution with r=1r=1 (waiting for just one success instead of rr successes).

Visualizing the Distribution

Let’s visualize a Geometric distribution with p=0.4p = 0.4 (our free throw example):

<Figure size 1000x400 with 1 Axes>

The PMF shows exponentially decreasing probabilities - you’re most likely to succeed on the first few trials. The shaded region shows mean ± 1 standard deviation.

<Figure size 1000x400 with 1 Axes>

The CDF shows P(X ≤ k), approaching 1 as k increases (eventually you’ll succeed). The red dashed line marks the mean.

Quick Check Questions

  1. You flip a coin until you get your first Heads. What distribution models this and what is the parameter?

  1. For a Geometric distribution with p = 0.25, what is the expected value (mean)?

  1. You’re calling customer service and have a 20% chance each attempt of getting through. Should you model this with Geometric or Binomial?

  1. Which is more likely for a Geometric distribution with p = 0.5: success on the 1st trial or success on the 3rd trial?

  1. For a Geometric distribution, why does the variance equal (1-p)/p²?

4. Negative Binomial Distribution

The Negative Binomial distribution models the number of independent Bernoulli trials needed to achieve a fixed number of successes (rr). It generalizes the Geometric distribution (where r=1r=1).

Concrete Example

You’re rolling a die until you get 3 sixes. Each roll has p = 1/6 probability of rolling a six. How many rolls will it take to get your 3rd six?

We model this with a random variable XX:

The probabilities are:

The Negative Binomial PMF

For trials with success probability pp and target rr successes:

P(X=k)=(k1r1)pr(1p)krP(X=k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}

for k=r,r+1,r+2,k = r, r+1, r+2, \dots

Understanding the formula: This means r1r-1 successes in the first k1k-1 trials, and the kk-th trial is the rr-th success.

Visual breakdown: The following diagram shows how the negative binomial formula counts all possible sequences and combines their probabilities:

<Figure size 1600x1600 with 1 Axes>

The diagram shows how the negative binomial formula works: we need exactly r1r-1 successes in the first k1k-1 trials (which can be arranged in (k1r1)\binom{k-1}{r-1} ways), and then the kk-th trial must be a success. Each of the 6 sequences shown has the same probability, and we multiply by the number of sequences to get the total probability.

Now that we understand the formula and its visualization, let’s summarize the essential properties of the negative binomial distribution:

Key Characteristics

Mean: E[X]=rpE[X] = \frac{r}{p}

Variance: Var(X)=r(1p)p2Var(X) = \frac{r(1-p)}{p^2}

Standard Deviation: SD(X)=r(1p)pSD(X) = \frac{\sqrt{r(1-p)}}{p}

Visualizing the Distribution

Let’s visualize our die example: Negative Binomial distribution with r=3r = 3 sixes and p=1/6p = 1/6:

<Figure size 1000x400 with 1 Axes>

The PMF shows the distribution is centered around the expected value r/p = 3/(1/6) = 18 trials. You can see our calculated P(X=4) ≈ 0.0116 as a small bar near the left tail at k=4. The shaded region shows mean ± 1 standard deviation.

<Figure size 1000x400 with 1 Axes>

The CDF shows P(X ≤ k), the cumulative probability of getting 3 sixes within k rolls. At k=4, the CDF shows P(X ≤ 4) = P(X=3) + P(X=4) ≈ 0.0046 + 0.0116 ≈ 0.0162, which is the very low cumulative probability in the left tail. The red dashed line marks the mean (18 trials).

Quick Check Questions

  1. You flip a fair coin until you get 5 Heads. What distribution models this and what are the parameters?

  1. For a Negative Binomial distribution with r = 4 and p = 0.5, what is the expected value (mean)?

  1. A basketball player practices free throws until making 10 successful shots. Each shot has a 70% success rate. Which distribution and why?

  1. How is Negative Binomial related to Geometric distribution?

  1. For Negative Binomial, why is the variance r(1-p)/p²?

5. Poisson Distribution

The Poisson distribution models the number of events occurring in a fixed interval of time or space when events happen independently at a constant average rate.

Concrete Example

You receive an average of 4 customer calls per hour. How many calls will you get in the next hour?

We model this with a random variable XX:

The average rate is λ=4\lambda = 4 calls/hour.

The Poisson PMF

For events occurring at average rate λ\lambda:

P(X=k)=eλλkk!for k=0,1,2,P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!} \quad \text{for } k = 0, 1, 2, \dots

where e2.71828e \approx 2.71828 is Euler’s number.

Let’s verify for our example (λ=4):

Algorithm visualization: The following diagram visualizes the Poisson formula as competing forces, building intuition for why each component exists:

<Figure size 1400x900 with 1 Axes>

The diagram above visualizes the Poisson distribution (λ = 4) using a three forces metaphor that explains how each component of the formula P(X=k)=λkeλk!P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!} shapes the probability distribution:

The Three Forces:

  1. The Driver (Numerator: λk\lambda^k) — Shown in the orange box, this force pushes probabilities UP exponentially. As k increases, the numerator grows rapidly when λ is large. This represents the “raw likelihood” of k events based on the rate λ.

  2. The Brake (Denominator: k!k!) — Shown in the green box, this force pulls probabilities DOWN super-exponentially. Factorial growth eventually crushes the numerator for large k values, causing the rapid decay in the right tail of the distribution.

  3. The Scaler (Constant: eλe^{-\lambda}) — Shown in the blue box, this is a fixed dampening factor that normalizes the distribution to ensure all probabilities sum to 1.

Reading the Diagram:

This “tug-of-war” between the driver and brake forces creates the characteristic bell-like shape of the Poisson distribution, with the peak occurring where these forces are balanced.

Key Characteristics

Mean: E[X]=λE[X] = \lambda

Variance: Var(X)=λVar(X) = \lambda

Standard Deviation: SD(X)=λSD(X) = \sqrt{\lambda}

Note: Mean and variance are equal in a Poisson distribution, so the standard deviation is simply the square root of λ.

Relationship to Other Distributions: The Poisson distribution is an approximation to the Binomial distribution when nn is large, pp is small, and λ=np\lambda = np is moderate. Rule of thumb: use Poisson approximation when n20n \ge 20 and p0.05p \le 0.05.

Visualizing the Distribution

Let’s visualize a Poisson distribution with λ=4\lambda = 4 (our call center example):

<Figure size 1000x400 with 1 Axes>

The PMF shows the distribution centered around λ = 4 with reasonable probability for nearby values. The shaded region shows mean ± 1 standard deviation (4=2\sqrt{4} = 2).

<Figure size 1000x400 with 1 Axes>

The CDF shows P(X ≤ k), useful for questions like “What’s the probability of 6 or fewer calls?” The red dashed line marks the mean.

Quick Check Questions

  1. A call center receives an average of 12 calls per hour. What distribution models the number of calls in one hour and what is the parameter?

  1. For a Poisson distribution with λ = 7, what are the mean and variance?

  1. You count the number of typos on a random page of a book. The average is 2 typos per page. Which distribution?

  1. True or False: In a Poisson distribution, the mean can be different from the variance.

  1. When can Poisson approximate Binomial?

6. Hypergeometric Distribution

The Hypergeometric distribution models the number of successes in a sample drawn without replacement from a finite population. This is different from Binomial, which assumes sampling with replacement (or infinite population).

Concrete Example

You draw 5 cards from a standard deck of 52 cards. How many Aces will you get?

We model this with a random variable XX:

The Hypergeometric PMF

For sampling without replacement:

P(X=k)=(Kk)(NKnk)(Nn)P(X=k) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}}

This is: (ways to choose k successes from K) × (ways to choose n-k failures from N-K) / (total ways to choose n items from N).

Key Characteristics

Mean: E[X]=nKNE[X] = n \frac{K}{N}

Variance: Var(X)=nKN(1KN)(NnN1)Var(X) = n \frac{K}{N} \left(1 - \frac{K}{N}\right) \left(\frac{N-n}{N-1}\right)

Standard Deviation: SD(X)=nKN(1KN)(NnN1)SD(X) = \sqrt{n \frac{K}{N} \left(1 - \frac{K}{N}\right) \left(\frac{N-n}{N-1}\right)}

The term NnN1\frac{N-n}{N-1} is the finite population correction factor. As NN \to \infty, this approaches 1, and Hypergeometric → Binomial with p=K/Np = K/N.

Visualizing the Distribution

Let’s visualize a Hypergeometric distribution with N=52, K=4, n=5 (our card example):

The PMF shows most likely to get 0 Aces (about 0.66 probability), less likely to get 1 or 2. The red dashed line marks the mean, and the orange shaded region shows mean ± 1 standard deviation.

The CDF shows P(X ≤ k), useful for questions like “What’s the probability of getting at most 1 Ace?” The red dashed line marks the mean.

Quick Check Questions

  1. You draw 7 cards from a deck of 52. You want to know how many hearts you get. What distribution models this and what are the parameters?

  1. For a Hypergeometric distribution with N=50, K=10, n=5, what is the expected value (mean)?

  1. A quality inspector randomly selects 10 products from a batch of 100 (where 15 are defective) without replacement. Which distribution?

  1. What’s the key difference between Binomial and Hypergeometric distributions?

  1. When can Hypergeometric be approximated by Binomial?

7. Discrete Uniform Distribution

The Discrete Uniform distribution models selecting one outcome from a finite set where all outcomes are equally likely.

Concrete Example

Suppose you roll a fair six-sided die. Each face (1, 2, 3, 4, 5, 6) has an equal probability of appearing.

We model this with a random variable XX:

The probabilities are:

The Discrete Uniform PMF

For a Discrete Uniform distribution on the integers from aa to bb (inclusive):

P(X=k)={1ba+1if k{a,a+1,,b}0otherwiseP(X=k) = \begin{cases} \frac{1}{b-a+1} & \text{if } k \in \{a, a+1, \ldots, b\} \\ 0 & \text{otherwise} \end{cases}

For our die example with a=1a = 1 and b=6b = 6:

Key Characteristics

Mean: E[X]=a+b2E[X] = \frac{a+b}{2}

Variance: Var(X)=(ba+1)2112Var(X) = \frac{(b-a+1)^2 - 1}{12}

Standard Deviation: SD(X)=(ba+1)2112SD(X) = \sqrt{\frac{(b-a+1)^2 - 1}{12}}

Relationship to Other Distributions: The Discrete Uniform distribution is a special case of the Categorical distribution where all kk categories have equal probability pi=1/kp_i = 1/k. If outcomes aren’t equally likely, use Categorical instead.

Visualizing the Distribution

Let’s visualize a Discrete Uniform distribution for a fair die (a=1a = 1, b=6b = 6):

The PMF shows six equal bars, each with probability 1/6, representing the fair die. The shaded region shows mean ± 1 standard deviation.

The CDF increases in equal steps of 1/6 at each value, reaching 1.0 at the maximum value. The red dashed line marks the mean.

Quick Check Questions

  1. You randomly select a card from a standard deck (52 cards). If X represents the card number (1-13, where 1=Ace, 11=Jack, 12=Queen, 13=King), what distribution models this and what are the parameters?

  1. For a Discrete Uniform distribution with a = 5 and b = 15, what is the probability of getting exactly 10?

  1. What is the mean of a Discrete Uniform distribution on the integers from 1 to 100?

  1. You’re modeling the outcome of rolling a fair six-sided die. Should you use Discrete Uniform or Categorical distribution?

  1. For a Discrete Uniform distribution on integers from a to b, why is the variance equal to (ba)(ba+2)12\frac{(b-a)(b-a+2)}{12}?

8. Categorical Distribution

The Categorical distribution models a single trial with multiple possible outcomes (more than 2), where each outcome has its own probability. It’s the generalization of the Bernoulli distribution to more than two categories.

Concrete Example

Suppose you’re rolling a loaded six-sided die where the faces have different probabilities:

We model this with a random variable XX:

The Categorical PMF

For a Categorical distribution with kk possible outcomes and probabilities p1,p2,,pkp_1, p_2, \ldots, p_k where i=1kpi=1\sum_{i=1}^k p_i = 1:

P(X=i)=pifor i=1,2,,kP(X=i) = p_i \quad \text{for } i = 1, 2, \ldots, k

For our loaded die example:

Key Characteristics

Mean: E[X]=i=1kipiE[X] = \sum_{i=1}^k i \cdot p_i (weighted average of outcomes)

Variance: Var(X)=i=1ki2pi(i=1kipi)2Var(X) = \sum_{i=1}^k i^2 \cdot p_i - \left(\sum_{i=1}^k i \cdot p_i\right)^2

Relationship to Other Distributions: Categorical generalizes Bernoulli (when k=2k=2) and is a special case of Discrete Uniform (when all pip_i are equal). For multiple trials, use the Multinomial distribution instead.

Visualizing the Distribution

Let’s visualize our loaded die Categorical distribution:

The PMF shows the different probabilities for each face of the loaded die.

The CDF increases by different amounts at each value, reflecting the varying probabilities.

Quick Check Questions

  1. A traffic light can be red (50%), yellow (10%), or green (40%). What distribution models the color when you arrive at an intersection?

  1. For a Categorical distribution with 4 equally likely outcomes, what is P(X = 2)?

  1. How is the Categorical distribution related to the Bernoulli distribution?

  1. You’re observing a single customer’s choice from a menu with 5 items having probabilities [0.3, 0.25, 0.2, 0.15, 0.1]. Should you use Categorical or Multinomial distribution?

  1. When can you model a Categorical distribution as a Discrete Uniform distribution?

9. Multinomial Distribution

The Multinomial distribution models performing a fixed number of independent trials where each trial has multiple possible outcomes (more than 2), and we count how many times each outcome occurs. It’s the generalization of the Binomial distribution to more than two categories.

Concrete Example

Suppose you roll a fair six-sided die 20 times. We want to know how many times each face (1, 2, 3, 4, 5, 6) appears.

We model this with a random vector X=(X1,X2,X3,X4,X5,X6)\mathbf{X} = (X_1, X_2, X_3, X_4, X_5, X_6) where:

The probabilities for a fair die are:

The Multinomial PMF

For nn independent trials with kk possible outcomes and probabilities p1,p2,,pkp_1, p_2, \ldots, p_k where i=1kpi=1\sum_{i=1}^k p_i = 1:

P(X1=x1,X2=x2,,Xk=xk)=n!x1!x2!xk!p1x1p2x2pkxkP(X_1=x_1, X_2=x_2, \ldots, X_k=x_k) = \frac{n!}{x_1! x_2! \cdots x_k!} \, p_1^{x_1} p_2^{x_2} \cdots p_k^{x_k}

where x1+x2++xk=nx_1 + x_2 + \cdots + x_k = n.

The term n!x1!x2!xk!\frac{n!}{x_1! x_2! \cdots x_k!} is the multinomial coefficient (see Chapter 3: Permutations of Identical Objects).

For our die example, the probability of getting exactly (3, 4, 2, 5, 4, 2) of each face:

P(X1=3,X2=4,X3=2,X4=5,X5=4,X6=2)=20!3!4!2!5!4!2!(16)20=1.34×1013×3.39×10160.00454\begin{align} P(X_1=3, X_2=4, X_3=2, X_4=5, X_5=4, X_6=2) &= \frac{20!}{3! \, 4! \, 2! \, 5! \, 4! \, 2!} \left(\frac{1}{6}\right)^{20} \\ &= 1.34 \times 10^{13} \times 3.39 \times 10^{-16} \\ &\approx 0.00454 \end{align}

Key Characteristics

Mean for each category: E[Xi]=npiE[X_i] = n p_i

Variance for each category: Var(Xi)=npi(1pi)Var(X_i) = n p_i (1-p_i)

Relationship to Other Distributions: Multinomial generalizes Binomial (when k=2k=2) and Categorical (single trial becomes multiple trials). Each individual category count XiX_i follows a Binomial distribution with parameters (n,pi)(n, p_i).

Visualizing the Distribution

Multinomial distributions are challenging to visualize since they involve multiple variables. Let’s look at a simple case with k=3k=3 categories:

The marginal distribution of any single category in a Multinomial distribution is actually a Binomial distribution! Here, Category 1 follows Binomial(n=15, p=1/3).

Connecting to our die example: We simplified to 3 categories for easier visualization, but the same principle applies to our 6-sided die example (n=20 rolls). Each face count would follow Binomial(n=20, p=1/6). The histogram would be similar but centered around 20/6 ≈ 3.33 instead of 15/3 = 5.

Quick Check Questions

  1. You flip a fair coin 30 times and count heads and tails. What distribution models the counts?

  1. For a Multinomial distribution with n=100 trials and k=4 equally likely categories, what is the expected count for any one category?

  1. How is the Multinomial distribution related to the Binomial distribution?

  1. You roll a die 100 times and count how many times each face (1-6) appears. Should you use Categorical or Multinomial distribution?

  1. In a Multinomial distribution, what is the relationship between the individual category counts X₁, X₂, ..., Xₖ?

10. Relationships Between Distributions

Understanding the connections between these distributions can deepen insight and provide useful approximations.

  1. Bernoulli as a special case of Binomial: A Binomial distribution with n=1n=1 trial (Binomial(1,p)Binomial(1, p)) is equivalent to a Bernoulli distribution (Bernoulli(p)Bernoulli(p)).

  2. Geometric as a special case of Negative Binomial: A Negative Binomial distribution modeling the number of trials until the first success (r=1r=1) (NegativeBinomial(1,p)NegativeBinomial(1, p)) is equivalent to a Geometric distribution (Geometric(p)Geometric(p)).

  3. Binomial Approximation to Hypergeometric: If the population size NN is much larger than the sample size nn (e.g., N>20nN > 20n), then drawing without replacement (Hypergeometric) is very similar to drawing with replacement. In this case, the Hypergeometric(N,K,nN, K, n) distribution can be well-approximated by the Binomial(n,p=K/Nn, p=K/N) distribution. The finite population correction factor NnN1\frac{N-n}{N-1} approaches 1.

  4. Poisson Approximation to Binomial: If the number of trials nn in a Binomial distribution is large, and the success probability pp is small, such that the mean λ=np\lambda = np is moderate, then the Binomial(n,pn, p) distribution can be well-approximated by the Poisson(λ=np\lambda = np) distribution. This is useful because the Poisson PMF is often easier to compute than the Binomial PMF when nn is large. A common rule of thumb is to use this approximation if n20n \ge 20 and p0.05p \le 0.05, or n100n \ge 100 and np10np \le 10.

Example: Poisson approximation to Binomial Consider Binomial(n=1000,p=0.005)Binomial(n=1000, p=0.005). Here nn is large, pp is small. The mean is λ=np=1000×0.005=5\lambda = np = 1000 \times 0.005 = 5. We can approximate this with Poisson(λ=5)Poisson(\lambda=5).

Let’s compare the PMF values of both distributions to see how well the Poisson approximation works in practice.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Setup distributions
n_binom_approx = 1000
p_binom_approx = 0.005
lambda_approx = n_binom_approx * p_binom_approx

binom_rv_approx = stats.binom(n=n_binom_approx, p=p_binom_approx)
poisson_rv_approx = stats.poisson(mu=lambda_approx)

# Compare PMFs
k_vals_compare = np.arange(0, 15)
binom_pmf = binom_rv_approx.pmf(k_vals_compare)
poisson_pmf = poisson_rv_approx.pmf(k_vals_compare)

print(f"Comparing Binomial(n={n_binom_approx}, p={p_binom_approx}) and Poisson(lambda={lambda_approx:.1f})")
print("k\tBinomial P(X=k)\tPoisson P(X=k)\tDifference")
for k, bp, pp in zip(k_vals_compare, binom_pmf, poisson_pmf):
    print(f"{k}\t{bp:.6f}\t{pp:.6f}\t{abs(bp-pp):.6f}")
Comparing Binomial(n=1000, p=0.005) and Poisson(lambda=5.0)
k	Binomial P(X=k)	Poisson P(X=k)	Difference
0	0.006654	0.006738	0.000084
1	0.033437	0.033690	0.000253
2	0.083929	0.084224	0.000296
3	0.140303	0.140374	0.000071
4	0.175731	0.175467	0.000264
5	0.175908	0.175467	0.000440
6	0.146590	0.146223	0.000367
7	0.104602	0.104445	0.000157
8	0.065245	0.065278	0.000033
9	0.036138	0.036266	0.000128
10	0.017996	0.018133	0.000137
11	0.008139	0.008242	0.000103
12	0.003371	0.003434	0.000063
13	0.001287	0.001321	0.000034
14	0.000456	0.000472	0.000016

The chart compares the Binomial(100, 0.03) distribution (blue bars) with the Poisson(3.0) approximation (red bars). The distributions are nearly identical, demonstrating that when n is large and p is small, the Poisson provides an excellent and computationally simpler approximation to the Binomial.

  1. Categorical as generalization of Bernoulli: A Categorical distribution with k=2k=2 categories (Categorical(p1,p2)Categorical(p_1, p_2) where p1+p2=1p_1 + p_2 = 1) is equivalent to a Bernoulli distribution (Bernoulli(p1)Bernoulli(p_1)). Categorical extends Bernoulli to handle more than two outcomes in a single trial.

  2. Multinomial as generalization of Binomial: A Multinomial distribution with k=2k=2 categories (Multinomial(n,p1,p2)Multinomial(n, p_1, p_2) where p1+p2=1p_1 + p_2 = 1) is equivalent to a Binomial distribution (Binomial(n,p1)Binomial(n, p_1)). Multinomial extends Binomial to count outcomes across more than two categories.

  3. Discrete Uniform as special case of Categorical: A Categorical distribution where all kk probabilities are equal (p1=p2==pk=1kp_1 = p_2 = \cdots = p_k = \frac{1}{k}) is a Discrete Uniform distribution on kk values. This represents maximum uncertainty about a single trial’s outcome.

  4. Marginal distributions of Multinomial are Binomial: If (X1,X2,,Xk)Multinomial(n,p1,p2,,pk)(X_1, X_2, \ldots, X_k) \sim Multinomial(n, p_1, p_2, \ldots, p_k), then each individual count XiX_i follows a Binomial distribution: XiBinomial(n,pi)X_i \sim Binomial(n, p_i). This makes sense because we’re just counting successes (category ii) vs. failures (all other categories) across nn trials.

Summary

In this chapter, we explored nine fundamental discrete probability distributions:

We learned the scenarios each distribution models, their parameters, PMFs, means, and variances. Critically, we saw how to leverage scipy.stats functions (pmf, cdf, rvs, mean, var, std, sf) to perform calculations, generate simulations, and visualize these distributions. We also discussed important relationships, such as:

Mastering these distributions provides a powerful toolkit for modeling various random phenomena encountered in data analysis, science, engineering, and business. In the next chapters, we will transition to continuous random variables and their corresponding common distributions.

Decision Tree: Choosing the Right Distribution

Use this decision tree to help identify which distribution fits your scenario:

Key Questions to Ask:

  1. How many trials? Single → Bernoulli/Categorical/Discrete Uniform. Fixed number → Binomial/Multinomial/Hypergeometric. Variable → Geometric/Negative Binomial.

  2. How many outcomes per trial? Two → Bernoulli/Binomial/Geometric/Negative Binomial. More than two → Categorical/Multinomial/Discrete Uniform.

  3. With or without replacement? With replacement (or infinite population) → Binomial. Without replacement (finite population) → Hypergeometric.

  4. What are you counting? Successes in fixed trials → Binomial/Multinomial. Trials until success → Geometric/Negative Binomial. Events in interval → Poisson.

  5. Are probabilities equal? Yes → Discrete Uniform. No → Categorical.

Example Applications:

Exploring Additional Distributions

While this chapter covers nine fundamental discrete distributions, many other distributions exist for specialized scenarios. Here’s how to learn about distributions beyond this chapter:

How to Approach Learning a New Distribution:

When you encounter a new distribution, follow these steps:

  1. Understand the Scenario: What real-world process does it model? What makes it different from distributions you already know?

  2. Identify the Parameters: What values define the distribution? (like nn and pp for Binomial, λ\lambda for Poisson)

  3. Study the PMF (or PDF for continuous): How are probabilities calculated? What’s the formula?

    • PMF = Probability Mass Function (discrete distributions, like those in this chapter)

    • PDF = Probability Density Function (continuous distributions, covered in Chapters 8-9)

  4. Learn Key Properties: What are the mean and variance? Are there special characteristics?

  5. Explore Relationships: How does it relate to distributions you already know? Is it a special case or generalization of something familiar?

  6. See Examples: Find concrete examples and visualizations to build intuition.

  7. Practice with Code: Use scipy.stats or similar libraries to work with the distribution hands-on.

Key Resources for Learning About Other Distributions:

  1. Wikipedia - Each distribution has a comprehensive article with a standardized format:

    • Definition and scenario

    • Parameters and support (possible values)

    • PMF formula (discrete) or PDF formula (continuous)

    • Mean, variance, and other properties

    • Relationships to other distributions

    • Examples and applications

    • Search for: “[Distribution name] distribution” (e.g., “Beta-Binomial distribution”)

  2. SciPy Documentation - Python’s scipy.stats module includes 100+ distributions:

    • Complete reference: https://docs.scipy.org/doc/scipy/reference/stats.html

    • Each distribution has: PMF (discrete) or PDF (continuous), CDF, mean, variance, random sampling

    • Includes code examples showing how to use each distribution

    • Discrete distributions: bernoulli, binom, geom, hypergeom, poisson, nbinom, randint, and many more

  3. Interactive Distribution Explorers:

    • Search for “distribution explorer” or “probability distribution visualizer”

    • These tools let you adjust parameters and see how distributions change

    • Helps build intuition about distribution behavior

  4. Classic Textbooks:

    • Introduction to Probability by Bertsekas & Tsitsiklis

    • A First Course in Probability by Sheldon Ross

    • Probability and Statistics by DeGroot & Schervish

    • These provide rigorous treatment with proofs and derivations

  5. Online Resources:

    • NIST Engineering Statistics Handbook: Comprehensive reference for common distributions

    • Wolfram MathWorld: Mathematical encyclopedia with detailed distribution information

    • Stack Exchange (Cross Validated): Q&A site for statistics questions

Examples of Other Discrete Distributions:

Here are some distributions you might encounter that we didn’t cover in detail:

Finding the Right Distribution:

If you have data or a scenario and need to find which distribution fits:

  1. Identify the process: Single trial? Fixed trials? Waiting time? Events in interval?

  2. Check the support: What values can the random variable take? (e.g., 0/1, non-negative integers, finite range)

  3. Consider the parameters: What aspects of the process can vary? (success probability, rate, sample size, etc.)

  4. Use the decision tree (see below) to narrow down candidates

  5. Test candidate distributions using visualizations and goodness-of-fit tests

  6. Consult domain literature: See what distributions are commonly used in your field

Exercises

  1. Customer Arrivals: The average number of customers arriving at a small cafe is 10 per hour. Assume arrivals follow a Poisson distribution. a. What is the probability that exactly 8 customers arrive in a given hour? b. What is the probability that 12 or fewer customers arrive in a given hour? c. What is the probability that more than 15 customers arrive in a given hour? d. Simulate 1000 hours of customer arrivals and plot a histogram of the results. Compare it to the theoretical PMF.

    b) The probability of 12 or fewer customers:

    prob_12_or_fewer = cafe_rv.cdf(12)
    print(f"P(12 or fewer customers) = {prob_12_or_fewer:.4f}")
    P(12 or fewer customers) = 0.7916
    

    c) The probability of more than 15 customers:

    prob_over_15 = cafe_rv.sf(15)
    print(f"P(More than 15 customers) = {prob_over_15:.4f}")
    P(More than 15 customers) = 0.0487
    

    d) Simulation and visualization:

    n_sim_hours = 1000
    sim_arrivals = cafe_rv.rvs(size=n_sim_hours)
    
    plt.figure(figsize=(10, 4))
    max_observed = np.max(sim_arrivals)
    bins = np.arange(0, max_observed + 2) - 0.5
    plt.hist(sim_arrivals, bins=bins, density=True, alpha=0.6, color='lightgreen', edgecolor='black', label='Simulated Arrivals')
    
    # Overlay theoretical PMF
    k_vals_cafe = np.arange(0, max_observed + 1)
    pmf_cafe = cafe_rv.pmf(k_vals_cafe)
    plt.plot(k_vals_cafe, pmf_cafe, 'ro-', linewidth=2, markersize=6, label='Theoretical PMF')
    
    plt.title(f'Simulated Customer Arrivals vs Poisson PMF (lambda={lambda_cafe})')
    plt.xlabel('Number of Customers per Hour')
    plt.ylabel('Probability / Density')
    plt.legend()
    plt.grid(axis='y', linestyle='--', alpha=0.6)
    plt.xlim(-0.5, max_observed + 1.5)
    plt.show()
    <Figure size 1000x400 with 1 Axes>

    The histogram closely matches the theoretical PMF, confirming the Poisson model.

  2. Quality Control: A batch contains 50 items, of which 5 are defective. You randomly sample 8 items without replacement. a. What distribution models the number of defective items in your sample? State the parameters. b. What is the probability that exactly 1 item in your sample is defective? c. What is the probability that at most 2 items in your sample are defective? d. What is the expected number of defective items in your sample?

    b) Probability of exactly 1 defective item:

    prob_1_defective = qc_rv.pmf(1)
    print(f"P(Exactly 1 defective in sample) = {prob_1_defective:.4f}")
    P(Exactly 1 defective in sample) = 0.4226
    

    c) Probability of at most 2 defective items:

    prob_at_most_2 = qc_rv.cdf(2)
    print(f"P(At most 2 defectives in sample) = {prob_at_most_2:.4f}")
    P(At most 2 defectives in sample) = 0.9758
    

    d) Expected number of defective items:

    expected_defective = qc_rv.mean()
    print(f"Expected number of defectives in sample = {expected_defective:.4f}")
    # Theoretical: E[X] = n * (K/N) = 8 * (5/50) = 0.8
    Expected number of defectives in sample = 0.8000
    
  3. Website Success: A new website feature has a 3% chance of being used by a visitor (p=0.03p=0.03). Assume visitors are independent. a. If 100 visitors come to the site, what is the probability that exactly 3 visitors use the feature? What distribution applies? b. What is the probability that 5 or fewer visitors use the feature out of 100? c. What is the expected number of users out of 100 visitors? d. A developer tests the feature repeatedly until the first user successfully uses it. What is the probability that the first success occurs on the 20th visitor? What distribution applies? e. What is the expected number of visitors needed to see the first success? f. How many visitors are expected until the 5th user is observed? What distribution applies?

    b) Probability of 5 or fewer users:

    prob_5_or_fewer = ws_binom_rv.cdf(5)
    print(f"P(5 or fewer users) = {prob_5_or_fewer:.4f}")
    P(5 or fewer users) = 0.9192
    

    c) Expected number of users:

    expected_users = ws_binom_rv.mean()
    print(f"Expected number of users = {expected_users:.2f}")
    # Theoretical: E[X] = n*p = 100 * 0.03 = 3
    Expected number of users = 3.00
    

    d) This follows a Geometric distribution. The probability that the first success occurs on trial 20:

    ws_geom_rv = stats.geom(p=p_ws)
    prob_first_on_20 = ws_geom_rv.pmf(19)  # scipy counts 19 failures before success
    print(f"Distribution: Geometric(p={p_ws})")
    print(f"P(First success on trial 20) = {prob_first_on_20:.4f}")
    Distribution: Geometric(p=0.03)
    P(First success on trial 20) = 0.0173
    

    e) Expected number of visitors until first success:

    expected_trials_geom = 1 / p_ws
    print(f"Expected visitors until first success = {expected_trials_geom:.2f}")
    # Theoretical: E[X] = 1/p = 1/0.03 ≈ 33.33
    Expected visitors until first success = 33.33
    

    f) This follows a Negative Binomial distribution with r=5r=5 successes:

    r_ws = 5
    expected_trials_nbinom = r_ws / p_ws
    print(f"Distribution: Negative Binomial(r={r_ws}, p={p_ws})")
    print(f"Expected visitors until 5th success = {expected_trials_nbinom:.2f}")
    # Theoretical: E[X] = r/p = 5/0.03 ≈ 166.67
    Distribution: Negative Binomial(r=5, p=0.03)
    Expected visitors until 5th success = 166.67