Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Introduction: The Ubiquitous Bell Curve

In the previous chapter, we explored the Law of Large Numbers (LLN), which tells us that the average of a large number of independent and identically distributed (IID) random variables converges to the expected value. That’s about the value it approaches. But what about the distribution of that average? Does it have a predictable shape?

Enter the Central Limit Theorem (CLT), often considered one of the most remarkable and useful results in probability theory. In essence, the CLT states that the distribution of the sum (or average) of a large number of IID random variables tends towards a Normal (Gaussian) distribution, regardless of the shape of the original distribution from which the variables are drawn, provided the original distribution has a finite variance.

This is profound! It explains why the Normal distribution appears so frequently in nature and in data analysis. Many real-world phenomena can be thought of as the sum or average of many small, independent effects (e.g., measurement errors, height influenced by many genetic and environmental factors), and the CLT predicts that these resulting phenomena will be approximately Normally distributed.

In this chapter, we will:

  1. Formally state the Central Limit Theorem.

  2. Understand the concept of convergence in distribution.

  3. Discuss the conditions required for the CLT to hold and its limitations.

  4. Explore key applications, particularly the Normal approximation to other distributions.

  5. Use Python simulations to visualize the CLT in action and apply it to problems.

Let’s start by stating the theorem more formally.

Statement and Intuition

The Central Limit Theorem (Lindeberg–Lévy CLT):

Let X1,X2,,XnX_1, X_2, \dots, X_n be a sequence of nn independent and identically distributed (IID) random variables, each having a finite expected value μ\mu and a finite non-zero variance σ2\sigma^2. Let Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i be the sample mean. As nn \to \infty, the distribution of the standardized sample mean converges to a standard Normal distribution:

Zn=Xˉnμσ/ndN(0,1)Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0, 1)

Where d\xrightarrow{d} denotes convergence in distribution.

Intuition: Imagine you have any distribution for a single random variable XX (as long as its variance isn’t infinite). This could be a Uniform distribution (like rolling a fair die), an Exponential distribution (like waiting times), or something completely irregular. Now, take a sample of nn values from this distribution and calculate their average, Xˉn\bar{X}_n. Repeat this process many times, collecting many sample averages. The CLT tells us that if nn is sufficiently large, the histogram of these collected sample averages will look like a bell curve (a Normal distribution).

Furthermore, the theorem specifies the parameters of this Normal distribution:

Example: Think about the average weight of 30 randomly chosen apples. Individual apple weights might follow some distribution (maybe skewed, maybe multimodal), but the CLT suggests that the distribution of the average weight calculated from samples of 30 apples will be approximately Normal. The more apples we average (n=50,n=100n=50, n=100), the closer the distribution of the average weight will be to a perfect Normal distribution.

Convergence in Distribution

The CLT uses the concept of convergence in distribution. This is different from convergence in probability which we saw with the Weak Law of Large Numbers (WLLN).

This means that for large nn, we can approximate the probability P(Xˉnx)P(\bar{X}_n \le x) by using the Normal distribution N(μ,σ2/n)N(\mu, \sigma^2/n). Specifically:

P(Xˉnx)Φ(xμσ/n)P(\bar{X}_n \le x) \approx \Phi\left(\frac{x - \mu}{\sigma/\sqrt{n}}\right)

Similarly, for the sum Sn=i=1nXiS_n = \sum_{i=1}^{n} X_i, we know E[Sn]=nμE[S_n] = n\mu and Var(Sn)=nσ2Var(S_n) = n\sigma^2. The CLT implies:

SnnμnσdN(0,1)\frac{S_n - n\mu}{\sqrt{n}\sigma} \xrightarrow{d} N(0, 1)

So, we can approximate the distribution of the sum SnS_n using N(nμ,nσ2)N(n\mu, n\sigma^2).

Conditions and Limitations

The power of the CLT is immense, but it’s crucial to understand its requirements:

  1. Independent and Identically Distributed (IID) Variables: The standard version of the CLT assumes the random variables XiX_i are independent and drawn from the same distribution. There are extensions (like the Lyapunov CLT or Lindeberg-Feller CLT) that relax these conditions, particularly the identical distribution part, but they require more complex conditions. For many practical applications, the IID assumption is a reasonable starting point.

  2. Finite Variance (σ2<\sigma^2 < \infty): The original distribution must have a finite variance. If the variance is infinite (e.g., the Cauchy distribution), the sample mean does not converge to a Normal distribution. The sample mean’s distribution might still converge, but to a different type of stable distribution.

  3. Sufficiently Large Sample Size (nn): The theorem states convergence as nn \to \infty. In practice, “sufficiently large” depends heavily on the shape of the original distribution.

    • If the original distribution is already symmetric and close to Normal, nn can be quite small (even n<10n<10).

    • If the original distribution is highly skewed (like Exponential), nn might need to be larger (e.g., n30n \ge 30 or n50n \ge 50) for the Normal approximation to be reasonably accurate.

    • There’s no single magic number for nn, but n=30n=30 is a commonly cited, often overly simplistic, rule of thumb. Visualization (like we’ll do in the hands-on section) is key.

  4. Applies to Sums or Averages: The CLT specifically describes the distribution of the sum or average of random variables, not the distribution of the individual variables themselves.

Applications: The Normal Approximation

One of the most frequent applications of the CLT is approximating probabilities for distributions that are difficult to calculate directly, especially sums of random variables.

Approximating the Binomial Distribution: Recall that a Binomial random variable XBinomial(n,p)X \sim Binomial(n, p) can be seen as the sum of nn independent Bernoulli(pp) random variables: X=i=1nYiX = \sum_{i=1}^{n} Y_i, where YiBernoulli(p)Y_i \sim Bernoulli(p). Each YiY_i has mean μ=p\mu = p and variance σ2=p(1p)\sigma^2 = p(1-p). By the CLT, if nn is large enough, the sum XX will be approximately Normally distributed.

A common rule of thumb for this approximation to be adequate is np5np \ge 5 and n(1p)5n(1-p) \ge 5. Some sources use np10np \ge 10 and n(1p)10n(1-p) \ge 10 for better accuracy.

Continuity Correction: When approximating a discrete distribution (like Binomial) with a continuous one (Normal), accuracy is improved by using a continuity correction. Since a Normal variable can take any value, while a Binomial variable only takes integer values, we adjust the interval.

Example: What is the probability of getting more than 60 heads in 100 flips of a fair coin? Here, XBinomial(n=100,p=0.5)X \sim Binomial(n=100, p=0.5). Mean μ=np=100×0.5=50\mu = np = 100 \times 0.5 = 50. Variance σ2=np(1p)=100×0.5×0.5=25\sigma^2 = np(1-p) = 100 \times 0.5 \times 0.5 = 25. Standard deviation σ=25=5\sigma = \sqrt{25} = 5. Since np=505np = 50 \ge 5 and n(1p)=505n(1-p) = 50 \ge 5, we can use the Normal approximation YN(50,25)Y \sim N(50, 25). We want P(X>60)P(X > 60), which is the same as P(X61)P(X \ge 61). Using continuity correction, we approximate this by P(Y610.5)=P(Y60.5)P(Y \ge 61 - 0.5) = P(Y \ge 60.5). Standardize the value: Z=60.5505=10.55=2.1Z = \frac{60.5 - 50}{5} = \frac{10.5}{5} = 2.1. So we need P(Z2.1)P(Z \ge 2.1), where ZN(0,1)Z \sim N(0, 1). Using a standard Normal table or scipy.stats: P(Z2.1)=1P(Z<2.1)10.9821=0.0179P(Z \ge 2.1) = 1 - P(Z < 2.1) \approx 1 - 0.9821 = 0.0179. The probability is approximately 1.79%.

Other applications include:

Hands-on: Simulating the Central Limit Theorem

Let’s use Python to see the CLT in action. We’ll simulate taking averages from a non-Normal distribution (the Exponential distribution) and see how the distribution of these averages becomes Normal as the sample size nn increases.

# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Set parameters for the simulation
# We'll use the Exponential distribution
# Lambda (rate parameter) for Exponential
lambda_param = 1.0 
# Theoretical mean (mu) = 1/lambda
theo_mean = 1 / lambda_param
# Theoretical variance (sigma^2) = 1/lambda^2
theo_var = 1 / (lambda_param**2)
# Theoretical standard deviation (sigma)
theo_std_dev = np.sqrt(theo_var) 

# Number of simulations (number of sample means to generate)
num_simulations = 10000 

# Sample sizes (n) to test
sample_sizes = [1, 2, 5, 10, 30, 100] 

print(f"Original Distribution: Exponential(lambda={lambda_param})")
print(f"Theoretical Mean (mu): {theo_mean:.4f}")
print(f"Theoretical Variance (sigma^2): {theo_var:.4f}")
print(f"Theoretical Std Dev (sigma): {theo_std_dev:.4f}\n")

# Create subplots
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
# Flatten axes array for easy iteration
axes = axes.ravel() 

# --- Simulation Loop ---
for i, n in enumerate(sample_sizes):
    # Store the means of samples
    sample_means = [] 
    for _ in range(num_simulations):
        # 1. Draw n samples from the Exponential distribution
        samples = np.random.exponential(scale=1/lambda_param, size=n)
        # 2. Calculate the mean of these n samples
        current_mean = np.mean(samples)
        # 3. Store the mean
        sample_means.append(current_mean)
    
    # --- Plotting ---
    ax = axes[i]
    # Plot histogram of the sample means
    ax.hist(sample_means, bins=50, density=True, alpha=0.7, label='Sample Means Hist')
    
    # Calculate theoretical mean and std dev for the sample mean distribution (CLT prediction)
    clt_mean = theo_mean
    clt_std_dev = theo_std_dev / np.sqrt(n)
    
    # Generate points for the theoretical Normal PDF curve
    x_values = np.linspace(min(sample_means), max(sample_means), 200)
    clt_pdf = stats.norm.pdf(x_values, loc=clt_mean, scale=clt_std_dev)
    
    # Plot the theoretical Normal PDF
    ax.plot(x_values, clt_pdf, 'r-', lw=2, label='CLT Normal PDF')
    
    ax.set_title(f'Distribution of Sample Means (n={n})')
    ax.set_xlabel('Sample Mean Value')
    ax.set_ylabel('Density')
    ax.legend()

plt.tight_layout()
plt.show()
Original Distribution: Exponential(lambda=1.0)
Theoretical Mean (mu): 1.0000
Theoretical Variance (sigma^2): 1.0000
Theoretical Std Dev (sigma): 1.0000

<Figure size 1500x800 with 6 Axes>

Observation: Notice how the histogram of sample means starts highly skewed (similar to the original Exponential distribution) when n=1n=1. As the sample size nn increases (from 2, 5, 10, 30, to 100), the shape of the histogram becomes increasingly symmetric and bell-shaped, closely matching the red curve representing the Normal distribution predicted by the CLT (N(μ,σ2/n)N(\mu, \sigma^2/n)). For n=30n=30 and especially n=100n=100, the fit is remarkably good, even though the original data came from a very non-Normal, skewed Exponential distribution.

Using Q-Q Plots for Visual Assessment

A Quantile-Quantile (Q-Q) plot is another excellent tool to visually check if a dataset follows a particular distribution (in this case, Normal). It plots the quantiles of the sample data against the theoretical quantiles of the target distribution. If the data follows the distribution, the points should fall approximately along a straight line.

Let’s generate Q-Q plots for our sample means for different nn.

fig_qq, axes_qq = plt.subplots(2, 3, figsize=(15, 10))
axes_qq = axes_qq.ravel()

# --- Simulation and Q-Q Plot Loop ---
for i, n in enumerate(sample_sizes):
    # Regenerate sample means for clarity (could reuse from above)
    sample_means_qq = []
    for _ in range(num_simulations):
        samples = np.random.exponential(scale=1/lambda_param, size=n)
        current_mean = np.mean(samples)
        sample_means_qq.append(current_mean)
        
    ax_qq = axes_qq[i]
    # Create Q-Q plot against the Normal distribution
    # stats.probplot generates the plot directly
    stats.probplot(sample_means_qq, dist="norm", plot=ax_qq) 
    
    # Calculate theoretical parameters for title
    clt_mean = theo_mean
    clt_std_dev = theo_std_dev / np.sqrt(n)
    
    ax_qq.set_title(f'Q-Q Plot vs Normal (n={n})\nMean={clt_mean:.2f}, SD={clt_std_dev:.2f}')

plt.tight_layout()
plt.show()
<Figure size 1500x1000 with 6 Axes>

Observation: The Q-Q plots reinforce our findings. For small nn (like n=1,n=2n=1, n=2), the points deviate significantly from the straight red line, especially at the tails, indicating non-Normality. As nn increases, the points align much more closely with the line, showing that the distribution of sample means is increasingly well-approximated by a Normal distribution. For n=30n=30 and n=100n=100, the points lie almost perfectly on the line.

Hands-on: Normal Approximation to Binomial

Let’s verify the coin flip example calculation (P(X>60)P(X > 60) for XBinomial(100,0.5)X \sim Binomial(100, 0.5)) using Python.

# Parameters for Binomial
n_binom = 100
p_binom = 0.5

# Calculate exact probability using Binomial CDF/SF
# P(X > 60) = P(X >= 61) = 1 - P(X <= 60)
# Using Survival Function (SF = 1 - CDF) is often more numerically stable for upper tail
exact_prob = stats.binom.sf(k=60, n=n_binom, p=p_binom)

# Calculate parameters for Normal approximation
mu_approx = n_binom * p_binom
var_approx = n_binom * p_binom * (1 - p_binom)
sigma_approx = np.sqrt(var_approx)

# Calculate approximated probability WITHOUT continuity correction
# P(Y > 60) where Y ~ N(mu_approx, var_approx)
approx_prob_no_cc = stats.norm.sf(x=60, loc=mu_approx, scale=sigma_approx)

# Calculate approximated probability WITH continuity correction
# P(Y >= 60.5) where Y ~ N(mu_approx, var_approx)
# We want P(X > 60) which is P(X >= 61). The continuity correction is P(Y >= 61 - 0.5) = P(Y >= 60.5)
approx_prob_cc = stats.norm.sf(x=60.5, loc=mu_approx, scale=sigma_approx)

print(f"Binomial Parameters: n={n_binom}, p={p_binom}")
print(f"Normal Approx Params: mean={mu_approx}, std_dev={sigma_approx:.4f}\n")

print(f"Exact Probability P(X > 60) using Binomial: {exact_prob:.6f}")
print(f"Approx Probability P(Y > 60) (No CC):       {approx_prob_no_cc:.6f}")
print(f"Approx Probability P(Y >= 60.5) (With CC):  {approx_prob_cc:.6f}")

print(f"\nError (No CC): {abs(approx_prob_no_cc - exact_prob):.6f}")
print(f"Error (With CC): {abs(approx_prob_cc - exact_prob):.6f}")
Binomial Parameters: n=100, p=0.5
Normal Approx Params: mean=50.0, std_dev=5.0000

Exact Probability P(X > 60) using Binomial: 0.017600
Approx Probability P(Y > 60) (No CC):       0.022750
Approx Probability P(Y >= 60.5) (With CC):  0.017864

Error (No CC): 0.005150
Error (With CC): 0.000264

Observation: As expected, the Normal approximation with continuity correction (0.017864) is significantly closer to the exact Binomial probability (0.017600) than the approximation without it (0.022750). This highlights the importance of the continuity correction when approximating a discrete distribution with a continuous one.

Summary

The Central Limit Theorem is a cornerstone of probability and statistics. It tells us that under fairly general conditions (IID variables, finite variance), the distribution of the sample mean (or sum) converges to a Normal distribution as the sample size grows large. This convergence is in distribution, meaning the shape of the standardized distribution approaches the standard Normal bell curve.

This theorem is incredibly practical:

Through simulation, we visually confirmed how the distribution of sample means from an Exponential distribution becomes increasingly Normal as the sample size nn increases, validating the CLT’s prediction. We also demonstrated the effectiveness and importance of the continuity correction when using the Normal distribution to approximate Binomial probabilities.

Understanding the CLT empowers you to make approximations, understand statistical methods, and appreciate the surprising emergence of order (the Normal distribution) from the combination of many independent random events.

Exercises

  1. Simulating from Uniform: Repeat the simulation and plotting steps (histograms and Q-Q plots) from the “Hands-on: Simulating the Central Limit Theorem” section, but this time draw samples from a Uniform(0, 1) distribution instead of an Exponential distribution. Does the convergence to Normality appear faster or slower compared to the Exponential case? Why might this be?

    • Hint: The Uniform(0, 1) distribution has mean μ=0.5\mu=0.5 and variance σ2=1/12\sigma^2=1/12. Consider the symmetry of the Uniform distribution compared to the Exponential.

  2. Poisson Approximation: The number of calls arriving at a call center follows a Poisson distribution with an average rate of λ=30\lambda = 30 calls per hour. We are interested in the probability of receiving fewer than 25 calls in a given hour.

    • Calculate the exact probability using scipy.stats.poisson.

    • Use the Normal approximation to the Poisson distribution (recall for Poisson(λ\lambda), μ=λ\mu=\lambda and σ2=λ\sigma^2=\lambda) to estimate this probability. Remember to use the continuity correction. Compare the approximation to the exact value. (Is the approximation valid here? Check the rule of thumb λ5\lambda \ge 5 or λ10\lambda \ge 10).

  3. Limitations Question: Suppose you are averaging samples from a Cauchy distribution. Why does the standard Central Limit Theorem not apply in this case? What key condition is violated?