Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

In the previous chapter, we explored conditional probability – how the probability of an event changes given that another event has occurred. Now, we’ll delve into one of the most powerful and widely applicable results stemming from conditional probability: Bayes’ Theorem. This theorem provides a formal way to update our beliefs (probabilities) in light of new evidence. We will also formally define and explore the concept of independence between events, a crucial idea for simplifying probability calculations.

Learning Objectives:

1. Bayes’ Theorem: Derivation and Interpretation

Bayes’ Theorem provides a way to “reverse” conditional probabilities. If we know P(BA)P(B|A), Bayes’ Theorem helps us find P(AB)P(A|B). It’s named after Reverend Thomas Bayes (1701-1761), who first provided an equation that allows new evidence to update beliefs.

Derivation:

Recall the definition of conditional probability:

  1. P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}, provided P(B)>0P(B) > 0.

  2. P(BA)=P(BA)P(A)P(B|A) = \frac{P(B \cap A)}{P(A)}, provided P(A)>0P(A) > 0.

Since P(AB)=P(BA)P(A \cap B) = P(B \cap A), we can rearrange these equations:

  1. P(AB)=P(AB)P(B)P(A \cap B) = P(A|B) P(B)

  2. P(BA)=P(BA)P(A)P(B \cap A) = P(B|A) P(A)

Setting them equal gives:

P(AB)P(B)=P(BA)P(A)P(A|B) P(B) = P(B|A) P(A)

Dividing by P(B)P(B) (assuming P(B)>0P(B) > 0), we get Bayes’ Theorem:

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) P(A)}{P(B)}

Interpretation:

Let’s think of AA as an event or hypothesis we are interested in (e.g., “a patient has a specific disease,” “a coin is biased”) and BB as new evidence or data observed (e.g., “the patient tested positive,” “we observed 8 heads in 10 flips”).

P(B)=P(BA)P(A)+P(BAc)P(Ac).P(B)=P(B\mid A)P(A)+P(B\mid A^c)P(A^c).

1.1 Visual intuition: Bayes’ Theorem (area model)

You can read Bayes’ theorem directly from the picture below as an area ratio:

So by the definition of conditional probability:

P(AB)=P(AB)P(B).P(A\mid B)=\frac{P(A\cap B)}{P(B)}.

Now rewrite the overlap using the multiplication rule:

P(AB)=P(BA)P(A),P(A\cap B)=P(B\mid A)\,P(A),

which gives the compact “Bayes form”:

P(AB)=P(BA)P(A)P(B).P(A\mid B)=\frac{P(B\mid A)\,P(A)}{P(B)}.

To connect this directly to the area model, expand the denominator by splitting (B) into the part inside (A) and the part inside (A^c):

P(B)=P(BA)+P(BAc)=P(BA)P(A)+P(BAc)P(Ac).\begin{align*} P(B) &= P(B\cap A)+P(B\cap A^c) \\ &= P(B\mid A)P(A)+P(B\mid A^c)P(A^c). \end{align*}

Substitute into the Bayes form:

P(AB)=P(BA)P(A)P(BA)P(A)+P(BAc)P(Ac).\begin{align*} P(A\mid B) &=\frac{P(B\mid A)P(A)}{P(B\mid A)P(A)+P(B\mid A^c)P(A^c)}. \end{align*}
Area model: P(A\mid B) is “the share of the shaded B region that falls inside the A strip”.

Area model: P(AB)P(A\mid B) is “the share of the shaded BB region that falls inside the AA strip”.

How to read the diagram

2. Updating Beliefs: Prior and Posterior Probabilities

The core idea of Bayesian thinking is updating beliefs. We start with a prior belief, gather data (evidence), and update our belief to a posterior. This posterior can then become the prior for the next piece of evidence.

Example: Imagine you have a website and you’re testing a new ad banner.

After observing the visitor’s Browse history, your belief that the ad is effective increased from 30% (prior) to 60% (posterior).

3. Applications: The Diagnostic Test Example

One of the most classic and intuitive applications of Bayes’ Theorem is in interpreting the results of medical diagnostic tests.

Scenario:

Question: If a randomly selected person tests positive, what is the probability they actually have the disease?

Let’s define the events:

What we know:

What we want to find: P(DPos)P(D|Pos) (The probability of having the disease given a positive test result).

Apply Bayes’ Theorem:

P(DPos)=P(PosD)P(D)P(Pos)P(D|Pos) = \frac{P(Pos|D) P(D)}{P(Pos)}

We need to find P(Pos)P(Pos). Use the Law of Total Probability:

P(Pos)=P(PosD)P(D)+P(PosDc)P(Dc)=(0.95)(0.01)+(0.05)(0.99)=0.0095+0.0495=0.0590\begin{align*} P(\text{Pos}) &= P(\text{Pos}|D)P(D) + P(\text{Pos}|D^c)P(D^c) \\ &= (0.95)(0.01) + (0.05)(0.99) \\ &= 0.0095 + 0.0495 \\ &= 0.0590 \end{align*}

Now substitute into Bayes’ Theorem:

P(DPos)=(0.95)(0.01)0.0590=0.00950.05900.161\begin{align*} P(D|Pos) &= \frac{(0.95)(0.01)}{0.0590} \\ &= \frac{0.0095}{0.0590} \\ &\approx 0.161 \end{align*}

Interpretation: Even with a positive test result from a 95% accurate test, the probability of actually having the disease is only about 16.1%! This seems counter-intuitive but highlights the strong influence of the low prior probability (prevalence) of the disease. Most positive tests come from the large group of healthy people who receive a false positive, rather than the small group of sick people who receive a true positive.

4. Independence of Events

Two events A and B are said to be independent if the occurrence (or non-occurrence) of one event does not affect the probability of the other event occurring.

I.e. Two events A and B are said to be independent if knowing whether one event happened tells you nothing about whether the other event will happen. Their probabilities are not linked.

4.1. Formal Definition

The formal mathematical definition of independence between two eventsis that Events A and B are independent if and only if: P(AB)=P(A)P(B)P(A \cap B) = P(A) P(B)

4.2. Alternative Definition (using conditional probability)

If P(B)>0P(B) > 0, A and B are independent if and only if:

P(AB)=P(A)P(A|B) = P(A)

Similarly, if P(A)>0P(A) > 0, independence means:

P(BA)=P(B)P(B|A) = P(B)

This definition aligns with the intuition: knowing B occurred doesn’t change the probability of A.

Important Note: Do not confuse independence with mutual exclusivity.

5. Conditional Independence

Sometimes two events appear related overall (in the same experiment), but become independent once we condition on a relevant context CC.

Think of CC as a context switch: if you fix the context, AA and BB stop giving each other information.


5.1. Notation and Definition

Before we explore conditional independence, we need to understand how to work with conditional probabilities involving multiple conditions.

Conditioning on Multiple Events

When we write P(AB,C)P(A \mid B, C), we mean the probability of event AA given that both events BB and CC have occurred. This is equivalent to conditioning on the intersection:

P(AB,C)=P(ABC)P(A \mid B, C) = P(A \mid B \cap C)

The comma in the conditioning clause is simply a convenient shorthand for the intersection. Both notations are used interchangeably in probability and statistics.

Formal Definition of Conditional Independence

Before we dive into the formal definition, recall that we’ve already seen independence in Section 4. Conditional independence is a related but distinct concept: it’s about independence that holds within a specific context, even though the events might be dependent overall when contexts are mixed.

We use the symbol \perp (read “is independent of”). We also use the symbol \Longleftrightarrow (if and only if) to indicate that both statements are equivalent—each implies the other.

How to read it: “Within the world where CC is known to be true, AA and BB behave like independent events.”

Visual representation: Conditional Independence

The Venn diagram below illustrates conditional independence. When we condition on event CC having occurred, we restrict our attention to the region CC. Within that region, events AA and BB are independent, meaning the overlap of AA and BB within CC equals what we’d expect from the product of their conditional probabilities.

Three-panel visualization of the conditional independence formula. Left panel: P(A \mid C) highlights all regions in both A and C. Middle panel: P(B \mid C) highlights all regions in both B and C. Right panel: P(A \cap B \mid C) highlights only the region in all three sets. The formula states these proportions satisfy: P(A \cap B \mid C) = P(A \mid C) \times P(B \mid C).

Three-panel visualization of the conditional independence formula. Left panel: P(AC)P(A \mid C) highlights all regions in both AA and CC. Middle panel: P(BC)P(B \mid C) highlights all regions in both BB and CC. Right panel: P(ABC)P(A \cap B \mid C) highlights only the region in all three sets. The formula states these proportions satisfy: P(ABC)=P(AC)×P(BC)P(A \cap B \mid C) = P(A \mid C) \times P(B \mid C).

Key observation from the three panels:

The three panels above show how each term in the conditional independence formula corresponds to different regions within CC:

Breaking down the formula: P(ABC)=P(AC)×P(BC)P(A \cap B \mid C) = P(A \mid C) \times P(B \mid C)

The independence relationship: Conditional independence means that when we restrict our view to region CC, these proportions satisfy the multiplication rule. The proportion in both AA and BB (right panel) equals the product of the individual proportions (left panel × middle panel). This is the visual embodiment of P(ABC)=P(AC)P(BC)P(A \cap B \mid C) = P(A \mid C) P(B \mid C).

This is different from looking at AA and BB in the entire sample space, where they might be dependent. Conditional independence means they become independent once we fix the context CC.




5.2. A visual mini-example: two flips of a randomly chosen coin

To make conditional independence concrete, we’ll use a simple example.

We have two coins:

Pick a coin uniformly at random, then flip it twice.

Let:

Part 1: Independence within each context

First, let’s see what happens when we know which coin we have. The key insight is that once you fix the context (know the coin), the two flips become independent.

What to notice:

If you fix the coin (you know CC or CcC^c), then the two flips are independent: knowing H1H_1 doesn’t change the probability of H2H_2. Mathematically:

P(H2H1,C)=P(H2C)andP(H2H1,Cc)=P(H2Cc)P(H_2\mid H_1, C) = P(H_2\mid C) \quad\text{and}\quad P(H_2\mid H_1, C^c) = P(H_2\mid C^c)

This means the joint probability factorizes (splits into a product) within each context:

P(H1H2C)=P(H1C)P(H2C)P(H_1\cap H_2\mid C) = P(H_1\mid C)\,P(H_2\mid C)

and similarly for CcC^c. Let’s visualize this:

Conditional independence within each context. Each panel fixes the coin type. Within a panel, the shaded overlap represents P(H_1\cap H_2\mid \text{context}), and the strip dimensions show P(H_1\mid \text{context}) and P(H_2\mid \text{context}).

Conditional independence within each context. Each panel fixes the coin type. Within a panel, the shaded overlap represents P(H1H2context)P(H_1\cap H_2\mid \text{context}), and the strip dimensions show P(H1context)P(H_1\mid \text{context}) and P(H2context)P(H_2\mid \text{context}).

Numerical verification:

For the fair coin (left panel):

For the biased coin (right panel):

In both panels, the joint probability equals the product of the marginals. This is what independence looks like.


Part 2: What happens when the context is hidden (mixing)

Now comes the surprising part: when we don’t know which coin was chosen, the flips are no longer independent!

Why dependence emerges:

If you don’t know the coin, then observing H1H_1 gives you information about which coin you probably have. For example:

Mathematical setup:

To find the overall probability of both flips being heads when we don’t know which coin was chosen, we apply the Law of Total Probability using the partition {C,Cc}\{C, C^c\}:

P(H1H2)=P(H1H2C)P(C)+P(H1H2Cc)P(Cc)\begin{align*} P(H_1\cap H_2) &= P(H_1\cap H_2\mid C)P(C) \\ &\quad + P(H_1\cap H_2\mid C^c)P(C^c) \end{align*}

This is the same principle we used earlier for single events (like P(B)=P(BA)P(A)+P(BAc)P(Ac)P(B) = P(B|A)P(A) + P(B|A^c)P(A^c)), but now applied to the intersection H1H2H_1 \cap H_2. We’re splitting the joint event into two mutually exclusive cases (fair coin vs. biased coin) and adding their weighted probabilities.

Let’s visualize how mixing the two contexts creates dependence:

The mixing effect. When we don’t know which coin was chosen, we must combine the two contexts (fair and biased) using their probabilities as weights.

The mixing effect. When we don’t know which coin was chosen, we must combine the two contexts (fair and biased) using their probabilities as weights.

Understanding the calculation:

When the context is hidden, we use the Law of Total Probability to combine both scenarios, weighting each by how likely it is to occur (note that P(C)+P(Cc)=1P(C) + P(C^c) = 1):

P(H1H2)=P(H1H2C)P(C)+P(H1H2Cc)P(Cc)\begin{align*} P(H_1\cap H_2) &= P(H_1\cap H_2\mid C)P(C) \\ &\quad + P(H_1\cap H_2\mid C^c)P(C^c) \end{align*}

Numerical verification:

Recall from our setup that we choose each coin with equal probability, so P(C)=P(Cc)=0.5P(C) = P(C^c) = 0.5. The fair coin gives heads with probability 0.5, and the biased coin gives heads with probability 0.75. Since each flip has the same probability regardless of whether it’s first or second, we have P(H1C)=P(H2C)=0.5P(H_1\mid C) = P(H_2\mid C) = 0.5 and P(H1Cc)=P(H2Cc)=0.75P(H_1\mid C^c) = P(H_2\mid C^c) = 0.75.

Now let’s calculate the individual probabilities:

P(H1)=P(H1C)P(C)+P(H1Cc)P(Cc)=(0.50)(0.50)+(0.75)(0.50)=0.625\begin{align*} P(H_1) &= P(H_1\mid C)P(C) + P(H_1\mid C^c)P(C^c) \\ &= (0.50)(0.50) + (0.75)(0.50) \\ &= 0.625 \end{align*}
P(H2)=0.625(by the same calculation)P(H_2) = 0.625 \quad \text{(by the same calculation)}

For the intersection, we combine two ideas:

  1. Law of Total Probability (shown above) gives us the structure:

    P(H1H2)=P(H1H2C)P(C)+P(H1H2Cc)P(Cc)\begin{align*} P(H_1\cap H_2) &= P(H_1\cap H_2\mid C)P(C) \\ &\quad + P(H_1\cap H_2\mid C^c)P(C^c) \end{align*}
  2. Conditional independence (from Part 1) lets us factorize within each context:

    For the fair coin:

    P(H1H2C)=P(H1C)×P(H2C)=0.5×0.5=0.25\begin{align*} P(H_1\cap H_2\mid C) &= P(H_1\mid C) \times P(H_2\mid C) \\ &= 0.5 \times 0.5 \\ &= 0.25 \end{align*}

    For the biased coin:

    P(H1H2Cc)=P(H1Cc)×P(H2Cc)=0.75×0.75=0.5625\begin{align*} P(H_1\cap H_2\mid C^c) &= P(H_1\mid C^c) \times P(H_2\mid C^c) \\ &= 0.75 \times 0.75 \\ &= 0.5625 \end{align*}
  3. Putting it together:

    P(H1H2)=P(H1H2C)P(C)+P(H1H2Cc)P(Cc)=(0.25)(0.50)+(0.5625)(0.50)=0.125+0.28125=0.40625\begin{align*} P(H_1\cap H_2) &= P(H_1\cap H_2\mid C)P(C) + P(H_1\cap H_2\mid C^c)P(C^c) \\ &= (0.25)(0.50) + (0.5625)(0.50) \\ &= 0.125 + 0.28125 \\ &= 0.40625 \end{align*}

Now let’s check for independence:

P(H1H2)=0.40625P(H_1\cap H_2) = 0.40625
P(H1)×P(H2)=0.625×0.625=0.390625\begin{align*} P(H_1) \times P(H_2) &= 0.625 \times 0.625 \\ &= 0.390625 \end{align*}

Since 0.406250.3906250.40625 \neq 0.390625, the joint probability does not equal the product. This means the events are dependent when the context is hidden.

Update check (alternative verification):

We can also verify dependence by checking whether observing H1H_1 updates our belief about H2H_2:

P(H2H1)=P(H1H2)P(H1)=0.406250.625=0.65\begin{align*} P(H_2\mid H_1) &= \frac{P(H_1\cap H_2)}{P(H_1)} \\ &= \frac{0.40625}{0.625} \\ &= 0.65 \end{align*}

But P(H2)=0.625P(H_2) = 0.625. Since P(H2H1)=0.650.625=P(H2)P(H_2\mid H_1) = 0.65 \neq 0.625 = P(H_2), observing H1H_1 does update our belief about H2H_2, confirming they are dependent.

The key insight: The flips are independent within each context, but dependent overall. This is because observing H1H_1 changes our belief about which coin we have, which in turn affects our belief about H2H_2.


Connecting back to the general principle:

Our coin example perfectly illustrates the key insight from Section 5.1:

This demonstrates that conditional independence does not imply unconditional independence. The dependence emerges when we mix contexts (average over the hidden variable CC). This pattern appears everywhere in statistics and data analysis: relationships that disappear within subgroups but appear in the overall data, or vice versa.


5.3. Key takeaways and real-world applications

The core insight in one sentence:

Conditioning on CC “locks in the context”—given CC, events AA and BB don’t update each other. When CC is hidden, mixing contexts can create dependence (or mask independence).

Why this matters in practice:

Conditional independence is the idea behind controlling for confounders in real experiments and data analysis:

  1. Medical research: An apparent relationship between a treatment and outcome might weaken, disappear, or even reverse once you control for age, sex, or baseline severity.

  2. Data analysis: Many “false discoveries” come from ignoring hidden grouping variables. Mixing data from different batches, sites, or time periods can create spurious correlations that look like real effects.

  3. Machine learning: Understanding when features are conditionally independent given others is crucial for building accurate models and avoiding confounding.

The practical lesson: Always ask “what context am I in?” When analyzing relationships between variables, consider whether there’s a hidden factor CC that, once accounted for, changes the picture entirely. This is one of the most important concepts for moving from probability theory to real-world statistical reasoning.


Chapter Summary

In the next part of the book, we will shift our focus from events to Random Variables – numerical outcomes of random phenomena – and explore their distributions. This will allow us to model and analyze probabilistic situations in a more structured way.

Exercises

  1. Two urns (Bayes): You pick an urn at random:

    • U1U_1 with probability 0.6 (contains 3 red, 2 blue)

    • U2U_2 with probability 0.4 (contains 1 red, 4 blue)

    You draw one ball and it is red. What is P(U1R)P(U_1\mid R)?

  2. Diagnostic test (posterior probability): A disease has prevalence P(D)=0.005P(D)=0.005 (0.5%). A test has:

    • Sensitivity P(extPosD)=0.98P( ext{Pos}\mid D)=0.98

    • False positive rate P(extPosDc)=0.03P( ext{Pos}\mid D^c)=0.03

    If someone tests positive, what is P(DextPos)P(D\mid ext{Pos})?

  3. Spam filter (Bayes): Suppose 20% of emails are spam:

    • P(S)=0.20P(S)=0.20

    • The word “FREE” appears in 50% of spam emails: P(FS)=0.50P(F\mid S)=0.50

    • The word “FREE” appears in 2% of non-spam emails: P(FSc)=0.02P(F\mid S^c)=0.02

    If an email contains “FREE”, what is P(SF)P(S\mid F)?

  4. Are these events independent? Roll a fair six-sided die.

    • AA = “the roll is even” = {2, 4, 6}

    • BB = “the roll is prime” = {2, 3, 5}

    Are AA and BB independent?

  5. Mutually exclusive vs independent: Roll a fair six-sided die.

    • AA = “the roll is 1”

    • BB = “the roll is 2”

    Are AA and BB independent?

  6. Conditional independence (coin mixture): You choose a coin:

    • Fair with probability P(C)=0.4P(C)=0.4 (so P(HC)=0.5P(H\mid C)=0.5)

    • Biased with probability P(Cc)=0.6P(C^c)=0.6 (so P(HCc)=0.8P(H\mid C^c)=0.8)

    Then you flip it twice. Let H1H_1 be “first flip is Heads” and H2H_2 be “second flip is Heads”.

    1. Compute P(H2)P(H_2) and P(H2H1)P(H_2\mid H_1) and decide whether H1H_1 and H2H_2 are independent overall.

    2. Show that H1H2CH_1 \perp H_2 \mid C.