In the previous chapter, we explored conditional probability – how the probability of an event changes given that another event has occurred. Now, we’ll delve into one of the most powerful and widely applicable results stemming from conditional probability: Bayes’ Theorem. This theorem provides a formal way to update our beliefs (probabilities) in light of new evidence. We will also formally define and explore the concept of independence between events, a crucial idea for simplifying probability calculations.
Bayes’ Theorem provides a way to “reverse” conditional probabilities. If we know P(B∣A), Bayes’ Theorem helps us find P(A∣B). It’s named after Reverend Thomas Bayes (1701-1761), who first provided an equation that allows new evidence to update beliefs.
Derivation:
Recall the definition of conditional probability:
P(A∣B)=P(B)P(A∩B), provided P(B)>0.
P(B∣A)=P(A)P(B∩A), provided P(A)>0.
Since P(A∩B)=P(B∩A), we can rearrange these equations:
P(A∩B)=P(A∣B)P(B)
P(B∩A)=P(B∣A)P(A)
Setting them equal gives:
P(A∣B)P(B)=P(B∣A)P(A)
Dividing by P(B) (assuming P(B)>0), we get Bayes’ Theorem:
Let’s think of A as an event or hypothesis we are interested in (e.g., “a patient has a specific disease,” “a coin is biased”) and B as new evidence or data observed (e.g., “the patient tested positive,” “we observed 8 heads in 10 flips”).
P(A): Prior probability — our belief about Abefore seeing the evidence B.
P(B∣A): Likelihood — the probability of observing the evidence Bgiven thatA is true.
P(B): Probability of the evidence — the overall probability of observing B, regardless of whether A is true or not. Using the Law of Total Probability with the partition {A,Ac}:
P(A∣B): Posterior probability — our updated belief about Aafter observing the evidence B.
Bayes’ Theorem tells us how to update our prior belief P(A) to a posterior belief P(A∣B) based on the likelihood of the evidence P(B∣A) and the overall probability of the evidence P(B).
Read it as: “given that we are in the shaded B region, what fraction of that region lies inside A?”
2. Updating Beliefs: Prior and Posterior Probabilities¶
The core idea of Bayesian thinking is updating beliefs. We start with a prior belief, gather data (evidence), and update our belief to a posterior. This posterior can then become the prior for the next piece of evidence.
Example: Imagine you have a website and you’re testing a new ad banner.
Hypothesis (A): The new ad banner is effective (e.g., has a click-through rate > 5%).
Prior ( P(A) ): Based on previous ad campaigns, you might initially believe there’s a 30% chance the new ad is effective. So, P(A)=0.30.
Evidence (B): You observe a visitor’s Browse history (e.g., they previously visited related product pages).
Likelihood ( P(B∣A)): The probability that a visitor has this Browse history given the ad is effective. Perhaps effective ads are better targeted, so this might be high, say P(B∣A)=0.70.
Likelihood ( P(B∣Ac) ): The probability that a visitor has this Browse history given the ad is not effective. This might be lower, say P(B∣Ac)=0.20.
Probability of Evidence ( P(B) ): Using the Law of Total Probability:
One of the most classic and intuitive applications of Bayes’ Theorem is in interpreting the results of medical diagnostic tests.
Scenario:
A certain disease affects 1% of the population. (Prevalence)
A test for the disease has 95% accuracy:
If a person has the disease, the test correctly identifies it 95% of the time. (Sensitivity)
If a person does not have the disease, the test correctly identifies it 95% of the time. (Specificity)
Sensitivity and Specificity
Looking at the origins and definitions of the words “sensitivity” and “specificity” can definitely help reinforce their meanings in this context.
Sensitivity:
Origin: Comes from the Latin word sentire, meaning “to feel” or “to perceive.”
General Meaning: The quality or condition of being sensitive; responsiveness to stimuli.
Connection to the Test: Think of the test as needing to “feel” or “perceive” the presence of the disease. A highly sensitive test has a strong ability to detect the disease when it is actually there. It’s responsive to the “stimulus” of the disease. If the disease is present, a sensitive test is likely to react (give a positive result). This aligns with its technical meaning of correctly identifying true positives.
Specificity:
Origin: Comes from the Latin word specificus, derived from species (meaning “kind” or “sort”) and facere (meaning “to make”). Essentially, “making of a particular kind.”
General Meaning: The quality of being specific; restricted to a particular item, condition, or effect; being precise or exact.
Connection to the Test: Think of the test as being designed for one specific target – the disease. A highly specific test is precise and only reacts to that particular target. It does not react to other things (like the absence of the disease or other conditions). It correctly identifies individuals who do not have the specific target disease (giving a negative result). This aligns with its technical meaning of correctly identifying true negatives.
How it Helps Understanding:
Sensitivity: Relates to the test’s ability to sense or detect the disease if it’s present. High sensitivity means good detection.
Specificity: Relates to the test being specific or precise to only the disease in question. High specificity means the test only flags the specific condition it’s looking for and avoids flagging healthy people.
So, the origins help frame the concepts: sensitivity is about detection power, while specificity is about precision and target accuracy.
Question: If a randomly selected person tests positive, what is the probability they actually have the disease?
Let’s define the events:
D: The person has the disease.
Dc: The person does not have the disease.
Pos: The person tests positive.
Neg: The person tests negative.
What we know:
P(D)=0.01 (Prior probability of having the disease - Prevalence)
P(Dc)=1−P(D)=0.99
P(Pos∣D)=0.95 (Probability of testing positive given you have the disease - Sensitivity)
P(Neg∣D)=1−P(Pos∣D)=0.05 (False Negative Rate)
P(Neg∣Dc)=0.95 (Probability of testing negative given you don’t have the disease - Specificity)
P(Pos∣Dc)=1−P(Neg∣Dc)=0.05 (False Positive Rate)
What we want to find:P(D∣Pos) (The probability of having the disease given a positive test result).
Apply Bayes’ Theorem:
P(D∣Pos)=P(Pos)P(Pos∣D)P(D)
We need to find P(Pos). Use the Law of Total Probability:
Interpretation: Even with a positive test result from a 95% accurate test, the probability of actually having the disease is only about 16.1%! This seems counter-intuitive but highlights the strong influence of the low prior probability (prevalence) of the disease. Most positive tests come from the large group of healthy people who receive a false positive, rather than the small group of sick people who receive a true positive.
Two events A and B are said to be independent if the occurrence (or non-occurrence) of one event does not affect the probability of the other event occurring.
I.e. Two events A and B are said to be independent if knowing whether one event happened tells you nothing about whether the other event will happen. Their probabilities are not linked.
The formal mathematical definition of independence between two eventsis that Events A and B are independent if and only if:
P(A∩B)=P(A)P(B)
Explanation
Events A and B are independent if and only if the probability that both events happen is equal to the product of their individual probabilities.
Mathematically:
P(A∩B)=P(A)×P(B)
P(A∩B) means “the probability of both A AND B occurring” (the intersection of A and B).
P(A) is the probability of event A occurring.
P(B) is the probability of event B occurring.
Why does this formula capture independence?
Think about it this way: If the events truly don’t influence each other, the chance of them both happening should just be a simple multiplication of their individual chances. If there was some influence (dependence), this multiplication wouldn’t accurately reflect the combined probability.
Example: Flipping a Fair Coin Twice 🪙
Let’s consider flipping a fair coin two times.
Event A: Getting heads (H) on the first flip.
Event B: Getting heads (H) on the second flip.
We want to know if these two events are independent.
Calculate P(A):
The probability of getting heads on a single flip of a fair coin is 21.
So, P(A)=21.
Calculate P(B):
The outcome of the second flip is not affected by the first flip. The coin has no memory. So, the probability of getting heads on the second flip is also 21.
So, P(B)=21.
Calculate P(A∩B):
This is the probability of getting heads on the first flip AND heads on the second flip (HH).
The possible outcomes when flipping a coin twice are: HH, HT, TH, TT. There are 4 equally likely outcomes.
Only one of these outcomes is HH.
So, P(A∩B)=41.
Check the Independence Formula:
Now we check if P(A∩B)=P(A)×P(B).
P(A)×P(B)=21×21=41
We already found that P(A∩B)=41.
Conclusion:
Since P(A∩B)=P(A)×P(B) (because 41=41), the events A (heads on the first flip) and B (heads on the second flip) are independent.
This makes intuitive sense: the result of the first coin flip doesn’t change the probability of getting heads or tails on the second flip.
4.2. Alternative Definition (using conditional probability)¶
If P(B)>0, A and B are independent if and only if:
P(A∣B)=P(A)
Similarly, if P(A)>0, independence means:
P(B∣A)=P(B)
This definition aligns with the intuition: knowing B occurred doesn’t change the probability of A.
Important Note: Do not confuse independence with mutual exclusivity.
Mutually exclusive events cannot happen together (A∩B=∅, so P(A∩B)=0).
Independent events can happen together, but one doesn’t affect the other’s probability.
If two events A and B have non-zero probabilities, they cannot be both mutually exclusive and independent. If they were mutually exclusive, P(A∩B)=0. If they were independent, P(A∩B)=P(A)P(B)>0. This is a contradiction.
When we write P(A∣B,C), we mean the probability of event A given that both events B and C have occurred. This is equivalent to conditioning on the intersection:
The comma in the conditioning clause is simply a convenient shorthand for the intersection. Both notations are used interchangeably in probability and statistics.
Example: Coin Flips
Consider flipping a coin twice after choosing which coin to use:
Let H1 = “first flip is heads”
Let H2 = “second flip is heads”
Let C = “we chose the fair coin”
Then P(H2∣H1,C) means: “What is the probability the second flip is heads, given that the first flip was heads AND we chose the fair coin?”
For a fair coin, knowing the first flip doesn’t help predict the second flip, so:
This equation says: “Given we have the fair coin, learning about the first flip gives us no additional information about the second flip.” This is an example of conditional independence, which we’ll explore in detail below.
Before we dive into the formal definition, recall that we’ve already seen independence in Section 4. Conditional independence is a related but distinct concept: it’s about independence that holds within a specific context, even though the events might be dependent overall when contexts are mixed.
We use the symbol ⊥ (read “is independent of”). We also use the symbol ⟺ (if and only if) to indicate that both statements are equivalent—each implies the other.
The Venn diagram below illustrates conditional independence. When we condition on event C having occurred, we restrict our attention to the region C. Within that region, events A and B are independent, meaning the overlap of A and B within C equals what we’d expect from the product of their conditional probabilities.
from pathlib import Path
import matplotlib.pyplot as plt
from matplotlib_venn import venn3, venn3_circles
# Create figure with three panels side by side
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 5))
# Function to create a Venn diagram with specific highlighting
def create_venn_panel(ax, highlight_mode, title_text):
"""
highlight_mode: 'A_and_C', 'B_and_C', or 'A_and_B_and_C'
"""
# Create three-set Venn diagram
v = venn3(subsets=(1, 1, 1, 1, 1, 1, 1), set_labels=('', '', ''), ax=ax)
# Default: all regions in C are light, regions outside C are very light
if v.get_patch_by_id('100'): # A only
v.get_patch_by_id('100').set_color('#f5f5f5')
v.get_patch_by_id('100').set_alpha(0.5)
if v.get_patch_by_id('010'): # B only
v.get_patch_by_id('010').set_color('#f5f5f5')
v.get_patch_by_id('010').set_alpha(0.5)
if v.get_patch_by_id('110'): # A ∩ B only (not in C)
v.get_patch_by_id('110').set_color('#e0e0e0')
v.get_patch_by_id('110').set_alpha(0.4)
# Color C regions based on highlight mode
if highlight_mode == 'A_and_C':
# Highlight all A ∩ C regions
if v.get_patch_by_id('001'): # C only
v.get_patch_by_id('001').set_color('#ffe0b2')
v.get_patch_by_id('001').set_alpha(0.5)
if v.get_patch_by_id('011'): # B ∩ C (not in A)
v.get_patch_by_id('011').set_color('#ffe0b2')
v.get_patch_by_id('011').set_alpha(0.5)
if v.get_patch_by_id('101'): # A ∩ C (not in B) - HIGHLIGHT
v.get_patch_by_id('101').set_color('#ff9800')
v.get_patch_by_id('101').set_alpha(0.85)
if v.get_patch_by_id('111'): # A ∩ B ∩ C - HIGHLIGHT
v.get_patch_by_id('111').set_color('#ff9800')
v.get_patch_by_id('111').set_alpha(0.85)
elif highlight_mode == 'B_and_C':
# Highlight all B ∩ C regions
if v.get_patch_by_id('001'): # C only
v.get_patch_by_id('001').set_color('#ffe0b2')
v.get_patch_by_id('001').set_alpha(0.5)
if v.get_patch_by_id('101'): # A ∩ C (not in B)
v.get_patch_by_id('101').set_color('#ffe0b2')
v.get_patch_by_id('101').set_alpha(0.5)
if v.get_patch_by_id('011'): # B ∩ C (not in A) - HIGHLIGHT
v.get_patch_by_id('011').set_color('#ff9800')
v.get_patch_by_id('011').set_alpha(0.85)
if v.get_patch_by_id('111'): # A ∩ B ∩ C - HIGHLIGHT
v.get_patch_by_id('111').set_color('#ff9800')
v.get_patch_by_id('111').set_alpha(0.85)
else: # 'A_and_B_and_C'
# Highlight only the center region
if v.get_patch_by_id('001'): # C only
v.get_patch_by_id('001').set_color('#ffe0b2')
v.get_patch_by_id('001').set_alpha(0.5)
if v.get_patch_by_id('101'): # A ∩ C (not in B)
v.get_patch_by_id('101').set_color('#ffe0b2')
v.get_patch_by_id('101').set_alpha(0.5)
if v.get_patch_by_id('011'): # B ∩ C (not in A)
v.get_patch_by_id('011').set_color('#ffe0b2')
v.get_patch_by_id('011').set_alpha(0.5)
if v.get_patch_by_id('111'): # A ∩ B ∩ C - HIGHLIGHT
v.get_patch_by_id('111').set_color('#ff6d00')
v.get_patch_by_id('111').set_alpha(0.9)
# Draw circles
venn3_circles(subsets=(1, 1, 1, 1, 1, 1, 1), linestyle='solid', linewidth=2, ax=ax)
# Add set labels
label_A = v.get_label_by_id('A')
if label_A:
label_A.set_text('A')
label_A.set_fontsize(16)
label_B = v.get_label_by_id('B')
if label_B:
label_B.set_text('B')
label_B.set_fontsize(16)
label_C = v.get_label_by_id('C')
if label_C:
label_C.set_text('C')
label_C.set_fontsize(16)
# Add title below the diagram
ax.text(0.5, -0.15, title_text,
transform=ax.transAxes,
fontsize=13, ha='center', va='top',
fontweight='bold')
return v
# Panel 1: P(A|C)
create_venn_panel(ax1, 'A_and_C', 'P(A | C)\nProportion of C that is in A')
# Panel 2: P(B|C)
create_venn_panel(ax2, 'B_and_C', 'P(B | C)\nProportion of C that is in B')
# Panel 3: P(A∩B|C)
create_venn_panel(ax3, 'A_and_B_and_C', 'P(A ∩ B | C)\nProportion of C in both A and B')
# Add overall title
fig.suptitle('Conditional Independence Formula Components: P(A ∩ B | C) = P(A | C) × P(B | C)',
fontsize=16, fontweight='bold', y=0.98)
plt.tight_layout()
fig.savefig("venn-conditional-independence.svg", format="svg", bbox_inches="tight", pad_inches=0.3)
Three-panel visualization of the conditional independence formula. Left panel: P(A∣C) highlights all regions in both A and C. Middle panel: P(B∣C) highlights all regions in both B and C. Right panel: P(A∩B∣C) highlights only the region in all three sets. The formula states these proportions satisfy: P(A∩B∣C)=P(A∣C)×P(B∣C).
Key observation from the three panels:
The three panels above show how each term in the conditional independence formula corresponds to different regions within C:
Breaking down the formula:P(A∩B∣C)=P(A∣C)×P(B∣C)
Left panel - P(A∣C): Shows the proportion of region C that lies in A
The dark orange regions represent all parts of A that overlap with C
Middle panel - P(B∣C): Shows the proportion of region C that lies in B
The dark orange regions represent all parts of B that overlap with C
Right panel - P(A∩B∣C): Shows the proportion of region C that lies in bothA and B
The dark orange region is the central intersection of all three sets
The independence relationship: Conditional independence means that when we restrict our view to region C, these proportions satisfy the multiplication rule. The proportion in both A and B (right panel) equals the product of the individual proportions (left panel × middle panel). This is the visual embodiment of P(A∩B∣C)=P(A∣C)P(B∣C).
This is different from looking at A and B in the entire sample space, where they might be dependent. Conditional independence means they become independent once we fix the contextC.
A more intuitive equivalent check (optional, but useful)
First, let’s see what happens when we know which coin we have. The key insight is that once you fix the context (know the coin), the two flips become independent.
What to notice:
If you fix the coin (you know C or Cc), then the two flips are independent: knowing H1 doesn’t change the probability of H2. Mathematically:
Conditional independence within each context. Each panel fixes the coin type. Within a panel, the shaded overlap represents P(H1∩H2∣context), and the strip dimensions show P(H1∣context) and P(H2∣context).
In both panels, the joint probability equals the product of the marginals. This is what independence looks like.
Part 2: What happens when the context is hidden (mixing)¶
Now comes the surprising part: when we don’t know which coin was chosen, the flips are no longer independent!
Why dependence emerges:
If you don’t know the coin, then observing H1 gives you information about which coin you probably have. For example:
Seeing Heads on the first flip makes the biased coin more likely
This makes Heads on the second flip more likely
So H1 and H2 are dependent when the context is hidden
Mathematical setup:
To find the overall probability of both flips being heads when we don’t know which coin was chosen, we apply the Law of Total Probability using the partition {C,Cc}:
This is the same principle we used earlier for single events (like P(B)=P(B∣A)P(A)+P(B∣Ac)P(Ac)), but now applied to the intersection H1∩H2. We’re splitting the joint event into two mutually exclusive cases (fair coin vs. biased coin) and adding their weighted probabilities.
Let’s visualize how mixing the two contexts creates dependence:
The mixing effect. When we don’t know which coin was chosen, we must combine the two contexts (fair and biased) using their probabilities as weights.
Understanding the calculation:
When the context is hidden, we use the Law of Total Probability to combine both scenarios, weighting each by how likely it is to occur (note that P(C)+P(Cc)=1):
Recall from our setup that we choose each coin with equal probability, so P(C)=P(Cc)=0.5. The fair coin gives heads with probability 0.5, and the biased coin gives heads with probability 0.75. Since each flip has the same probability regardless of whether it’s first or second, we have P(H1∣C)=P(H2∣C)=0.5 and P(H1∣Cc)=P(H2∣Cc)=0.75.
But P(H2)=0.625. Since P(H2∣H1)=0.65=0.625=P(H2), observing H1does update our belief about H2, confirming they are dependent.
The key insight: The flips are independent within each context, but dependent overall. This is because observing H1 changes our belief about which coin we have, which in turn affects our belief about H2.
Summary of all three scenarios
The calculations above showed what happens when the context is hidden (Scenario 1). Here’s a complete summary of all three cases:
Scenario 1: Context hidden (we do NOT know which coin)
Since P(H1∩H2∣Cc)=P(H1∣Cc)×P(H2∣Cc), the events are independent within this context.
Conclusion:H1 and H2 are conditionally independent given which coin was chosen (H1⊥H2∣C), but not independent when the coin is unknown.
Connecting back to the general principle:
Our coin example perfectly illustrates the key insight from Section 5.1:
We have H1⊥H2∣C (the flips are conditionally independent given the coin)
But we do NOT have H1⊥H2 (the flips are dependent overall when the coin is unknown)
This demonstrates that conditional independence does not imply unconditional independence. The dependence emerges when we mix contexts (average over the hidden variable C). This pattern appears everywhere in statistics and data analysis: relationships that disappear within subgroups but appear in the overall data, or vice versa.
Conditioning on C “locks in the context”—given C, events A and B don’t update each other. When C is hidden, mixing contexts can create dependence (or mask independence).
Why this matters in practice:
Conditional independence is the idea behind controlling for confounders in real experiments and data analysis:
Medical research: An apparent relationship between a treatment and outcome might weaken, disappear, or even reverse once you control for age, sex, or baseline severity.
Data analysis: Many “false discoveries” come from ignoring hidden grouping variables. Mixing data from different batches, sites, or time periods can create spurious correlations that look like real effects.
Machine learning: Understanding when features are conditionally independent given others is crucial for building accurate models and avoiding confounding.
The practical lesson: Always ask “what context am I in?” When analyzing relationships between variables, consider whether there’s a hidden factor C that, once accounted for, changes the picture entirely. This is one of the most important concepts for moving from probability theory to real-world statistical reasoning.
Bayes’ TheoremP(A∣B)=P(B)P(B∣A)P(A) provides a fundamental rule for updating probabilities (beliefs) based on new evidence.
It relates the posterior probabilityP(A∣B) to the prior probabilityP(A) and the likelihoodP(B∣A).
The term P(B) acts as a normalizing constant and can often be calculated using the Law of Total Probability.
Bayes’ Theorem is crucial in fields like medical diagnosis, machine learning (spam filtering, classification), and scientific reasoning.
Two events A and B are independent if P(A∩B)=P(A)P(B), or equivalently, P(A∣B)=P(A) (assuming P(B)>0). The occurrence of one does not change the probability of the other.
Events A and B are conditionally independent given C if P(A∩B∣C)=P(A∣C)P(B∣C). They become independent once the outcome of C is known.
Simulation is a valuable tool for building intuition about Bayes’ Theorem and independence by observing frequencies in generated data.
In the next part of the book, we will shift our focus from events to Random Variables – numerical outcomes of random phenomena – and explore their distributions. This will allow us to model and analyze probabilistic situations in a more structured way.