Appendix E: Summary of Formulas
This appendix provides a summary of the key formulas introduced in Chapters 1–8.
Chapter 2: The Language of Probability: Sets, Sample Spaces, and Events ¶ Axioms of Probability ¶ Let S S S be a sample space, and P ( A ) P(A) P ( A ) denote the probability of an event A A A .
Non-negativity : For any event A A A , the probability of A A A is greater than or equal to zero. P ( A ) ≥ 0 P(A)\ge 0 P ( A ) ≥ 0
Normalization : The probability of the entire sample space S S S is equal to 1. P ( S ) = 1 P(S)=1 P ( S ) = 1
Additivity for Disjoint Events : If A 1 , A 2 , A 3 , … A_1,A_2,A_3,\dots A 1 , A 2 , A 3 , … is a sequence of mutually exclusive (disjoint) events (i.e., A i ∩ A j = ∅ A_i\cap A_j=\emptyset A i ∩ A j = ∅ for all i ≠ j i\ne j i = j ), then the probability of their union is the sum of their individual probabilities.
P ( A 1 ∪ A 2 ∪ A 3 ∪ ⋯ ) = P ( A 1 ) + P ( A 2 ) + P ( A 3 ) + ⋯ P(A_1\cup A_2\cup A_3\cup \cdots)=P(A_1)+P(A_2)+P(A_3)+\cdots P ( A 1 ∪ A 2 ∪ A 3 ∪ ⋯ ) = P ( A 1 ) + P ( A 2 ) + P ( A 3 ) + ⋯ Basic Probability Rules ¶ Probability Range : For any event A A A : 0 ≤ P ( A ) ≤ 1 0\le P(A)\le 1 0 ≤ P ( A ) ≤ 1
Complement Rule : The probability that event A A A does not occur is 1 minus the probability that it does occur. P ( A c ) = 1 − P ( A ) P(A^c)=1-P(A) P ( A c ) = 1 − P ( A )
Addition Rule (General) : For any two events A A A and B B B (not necessarily disjoint), the probability that A A A or B B B (or both) occurs is:
P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) P(A\cup B)=P(A)+P(B)-P(A\cap B) P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) Empirical Probability ¶ The empirical probability of an event A A A is estimated from simulations:
P empirical ( A ) = Number of times event A occurred Total number of trials . P_{\text{empirical}}(A)=\frac{\text{Number of times event $A$ occurred}}{\text{Total number of trials}}. P empirical ( A ) = Total number of trials Number of times event A occurred . Chapter 3: Counting Techniques: Permutations and Combinations ¶ The Multiplication Principle ¶ If a procedure can be broken down into a sequence of k k k steps, with n 1 n_1 n 1 ways for the first step, n 2 n_2 n 2 for the second, … \dots … , n k n_k n k for the k k k -th step, then the total number of ways to perform the entire procedure is:
Total ways = n 1 × n 2 × ⋯ × n k . \text{Total ways}=n_1\times n_2\times \cdots \times n_k. Total ways = n 1 × n 2 × ⋯ × n k . Permutations (Order Matters) ¶ Permutations without Repetition : The number of permutations of n n n distinct objects taken k k k at a time:
P ( n , k ) = n ! ( n − k ) ! P(n,k)=\frac{n!}{(n-k)!} P ( n , k ) = ( n − k )! n ! Permutations with Repetition (Multinomial Coefficients) : The number of distinct permutations of n n n objects where there are n 1 n_1 n 1 identical objects of type 1, n 2 n_2 n 2 of type 2, … \dots … , n k n_k n k of type k k k (such that n 1 + n 2 + ⋯ + n k = n n_1+n_2+\cdots+n_k=n n 1 + n 2 + ⋯ + n k = n ):
n ! n 1 ! , n 2 ! ⋯ n k ! \frac{n!}{n_1!,n_2!\cdots n_k!} n 1 ! , n 2 ! ⋯ n k ! n ! Combinations (Order Doesn’t Matter) ¶ Combinations without Repetition : The number of combinations of n n n distinct objects taken k k k at a time (also “n n n choose k k k ”):
C ( n , k ) = ( n k ) = n ! k ! ( n − k ) ! C(n,k)=\binom{n}{k}=\frac{n!}{k!(n-k)!} C ( n , k ) = ( k n ) = k ! ( n − k )! n ! Combinations with Repetition : The number of combinations with repetition of n n n types of objects taken k k k at a time:
( n + k − 1 k ) = ( n + k − 1 ) ! k ! ( n − 1 ) ! \binom{n+k-1}{k}=\frac{(n+k-1)!}{k!(n-1)!} ( k n + k − 1 ) = k ! ( n − 1 )! ( n + k − 1 )! Probability with Equally Likely Outcomes ¶ The probability of an event E E E when all outcomes in the sample space S S S are equally likely:
P ( E ) = Number of outcomes favorable to E Total number of possible outcomes in S = ∣ E ∣ ∣ S ∣ . P(E)=\frac{\text{Number of outcomes favorable to }E}{\text{Total number of possible outcomes in }S}
=\frac{|E|}{|S|}. P ( E ) = Total number of possible outcomes in S Number of outcomes favorable to E = ∣ S ∣ ∣ E ∣ . Chapter 4: Conditional Probability ¶ Definition of Conditional Probability ¶ For any two events A A A and B B B from a sample space S S S , where P ( B ) > 0 P(B)>0 P ( B ) > 0 , the conditional probability of A A A given B B B is defined as:
P ( A ∣ B ) = P ( A ∩ B ) P ( B ) . P(A\mid B)=\frac{P(A\cap B)}{P(B)}. P ( A ∣ B ) = P ( B ) P ( A ∩ B ) . The Multiplication Rule for Conditional Probability ¶ Rearranging the definition of conditional probability gives:
P ( A ∩ B ) = P ( A ∣ B ) P ( B ) . P(A\cap B)=P(A\mid B)P(B). P ( A ∩ B ) = P ( A ∣ B ) P ( B ) . Similarly, if P ( A ) > 0 P(A)>0 P ( A ) > 0 :
P ( A ∩ B ) = P ( B ∣ A ) P ( A ) . P(A\cap B)=P(B\mid A)P(A). P ( A ∩ B ) = P ( B ∣ A ) P ( A ) . For three events A , B , C A,B,C A , B , C :
P ( A ∩ B ∩ C ) = P ( C ∣ A ∩ B ) , P ( B ∣ A ) , P ( A ) . P(A\cap B\cap C)=P(C\mid A\cap B),P(B\mid A),P(A). P ( A ∩ B ∩ C ) = P ( C ∣ A ∩ B ) , P ( B ∣ A ) , P ( A ) . The Law of Total Probability ¶ Let B 1 , B 2 , … , B n B_1,B_2,\dots,B_n B 1 , B 2 , … , B n be a partition of the sample space S S S . Then, for any event A A A in S S S :
P ( A ) = ∑ i = 1 n P ( A ∣ B i ) , P ( B i ) . P(A)=\sum_{i=1}^{n} P(A\mid B_i),P(B_i). P ( A ) = i = 1 ∑ n P ( A ∣ B i ) , P ( B i ) . Expanded form:
P ( A ) = P ( A ∣ B 1 ) P ( B 1 ) + P ( A ∣ B 2 ) P ( B 2 ) + ⋯ + P ( A ∣ B n ) P ( B n ) . P(A)=P(A\mid B_1)P(B_1)+P(A\mid B_2)P(B_2)+\cdots+P(A\mid B_n)P(B_n). P ( A ) = P ( A ∣ B 1 ) P ( B 1 ) + P ( A ∣ B 2 ) P ( B 2 ) + ⋯ + P ( A ∣ B n ) P ( B n ) . Chapter 5: Bayes’ Theorem and Independence ¶ Bayes’ Theorem ¶ Provides a way to “reverse” conditional probabilities. If P ( B ) > 0 P(B)>0 P ( B ) > 0 :
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) . P(A\mid B)=\frac{P(B\mid A)P(A)}{P(B)}. P ( A ∣ B ) = P ( B ) P ( B ∣ A ) P ( A ) . Where P ( B ) P(B) P ( B ) can often be calculated using the Law of Total Probability (e.g., with a partition A , A c {A,A^c} A , A c ):
P ( B ) = P ( B ∣ A ) P ( A ) + P ( B ∣ A c ) P ( A c ) . P(B)=P(B\mid A)P(A)+P(B\mid A^c)P(A^c). P ( B ) = P ( B ∣ A ) P ( A ) + P ( B ∣ A c ) P ( A c ) . Independence of Events ¶ Formal Definition : Events A A A and B B B are independent if and only if:
P ( A ∩ B ) = P ( A ) P ( B ) P(A\cap B)=P(A)P(B) P ( A ∩ B ) = P ( A ) P ( B ) Alternative Definition (using conditional probability) :
If P ( B ) > 0 P(B)>0 P ( B ) > 0 , A A A and B B B are independent if and only if: P ( A ∣ B ) = P ( A ) P(A\mid B)=P(A) P ( A ∣ B ) = P ( A )
Similarly, if P ( A ) > 0 P(A)>0 P ( A ) > 0 , independence means: P ( B ∣ A ) = P ( B ) P(B\mid A)=P(B) P ( B ∣ A ) = P ( B )
Conditional Independence ¶ Notation:
A ⊥ B ∣ C means “ A and B are conditionally independent given C .” A \perp B \mid C
\quad \text{means “$A$ and $B$ are conditionally independent given $C$.”} A ⊥ B ∣ C means “ A and B are conditionally independent given C .” Definition (with P ( C ) > 0 P(C)>0 P ( C ) > 0 ):
A ⊥ B ∣ C ⟺ P ( A ∩ B ∣ C ) = P ( A ∣ C ) P ( B ∣ C ) . A \perp B \mid C
\iff
P(A\cap B\mid C)=P(A\mid C) P(B\mid C). A ⊥ B ∣ C ⟺ P ( A ∩ B ∣ C ) = P ( A ∣ C ) P ( B ∣ C ) . Equivalent “no extra information” form: if P ( B ∩ C ) > 0 P(B\cap C)>0 P ( B ∩ C ) > 0 , then
A ⊥ B ∣ C ⟺ P ( A ∣ B ∩ C ) = P ( A ∣ C ) . A \perp B \mid C
\iff
P(A\mid B\cap C)=P(A\mid C). A ⊥ B ∣ C ⟺ P ( A ∣ B ∩ C ) = P ( A ∣ C ) . Likewise, if P ( A ∩ C ) > 0 P(A\cap C)>0 P ( A ∩ C ) > 0 , then
P ( B ∣ A ∩ C ) = P ( B ∣ C ) . P(B\mid A\cap C)=P(B\mid C). P ( B ∣ A ∩ C ) = P ( B ∣ C ) . Chapter 6: Discrete Random Variables ¶ Probability Mass Function (PMF) ¶ For a discrete random variable X X X , the PMF p X ( x ) p_X(x) p X ( x ) is:
p X ( x ) = P ( X = x ) p_X(x)=P(X=x) p X ( x ) = P ( X = x ) Properties of a PMF:
p X ( x ) ≥ 0 p_X(x)\ge 0 p X ( x ) ≥ 0 for all possible values x x x .
∑ x p X ( x ) = 1 \sum_x p_X(x)=1 x ∑ p X ( x ) = 1 (sum over all possible values x x x ).
Cumulative Distribution Function (CDF) ¶ For a random variable X X X , the CDF F X ( x ) F_X(x) F X ( x ) is:
F X ( x ) = P ( X ≤ x ) F_X(x)=P(X\le x) F X ( x ) = P ( X ≤ x ) For a discrete random variable X X X :
F X ( x ) = ∑ k ≤ x p X ( k ) F_X(x)=\sum_{k\le x} p_X(k) F X ( x ) = k ≤ x ∑ p X ( k ) Properties of a CDF:
0 ≤ F X ( x ) ≤ 1 0\le F_X(x)\le 1 0 ≤ F X ( x ) ≤ 1 for all x x x .
If a < b a<b a < b , then F X ( a ) ≤ F X ( b ) F_X(a)\le F_X(b) F X ( a ) ≤ F X ( b ) (non-decreasing).
lim x → − ∞ F X ( x ) = 0 \lim_{x\to -\infty} F_X(x)=0 x → − ∞ lim F X ( x ) = 0 lim x → + ∞ F X ( x ) = 1 \lim_{x\to +\infty} F_X(x)=1 x → + ∞ lim F X ( x ) = 1 P ( X > x ) = 1 − F X ( x ) P(X>x)=1-F_X(x) P ( X > x ) = 1 − F X ( x )
P ( a < X ≤ b ) = F X ( b ) − F X ( a ) P(a<X\le b)=F_X(b)-F_X(a) P ( a < X ≤ b ) = F X ( b ) − F X ( a ) for a < b a<b a < b .
P ( X = x ) = F X ( x ) − lim y → x − F X ( y ) P(X=x)=F_X(x)-\lim_{y\to x^-}F_X(y) P ( X = x ) = F X ( x ) − y → x − lim F X ( y ) (for a discrete RV, this is the jump at x x x ).
Expected Value (Mean) ¶ For a discrete random variable X X X :
E [ X ] = μ X = ∑ x x ⋅ p X ( x ) E[X]=\mu_X=\sum_x x\cdot p_X(x) E [ X ] = μ X = x ∑ x ⋅ p X ( x ) Variance ¶ For a random variable X X X with mean μ X \mu_X μ X :
Var ( X ) = σ X 2 = E [ ( X − μ X ) 2 ] \operatorname{Var}(X)=\sigma_X^2=E[(X-\mu_X)^2] Var ( X ) = σ X 2 = E [( X − μ X ) 2 ] For a discrete random variable X X X :
Var ( X ) = ∑ x ( x − μ X ) 2 ⋅ p X ( x ) \operatorname{Var}(X)=\sum_x (x-\mu_X)^2\cdot p_X(x) Var ( X ) = x ∑ ( x − μ X ) 2 ⋅ p X ( x ) Computational formula for variance:
Var ( X ) = E [ X 2 ] − ( E [ X ] ) 2 \operatorname{Var}(X)=E[X^2]-(E[X])^2 Var ( X ) = E [ X 2 ] − ( E [ X ] ) 2 Where E [ X 2 ] E[X^2] E [ X 2 ] for a discrete random variable is:
E [ X 2 ] = ∑ x x 2 ⋅ p X ( x ) E[X^2]=\sum_x x^2\cdot p_X(x) E [ X 2 ] = x ∑ x 2 ⋅ p X ( x ) Standard Deviation ¶ The positive square root of the variance:
S D ( X ) = σ X = Var ( X ) SD(X)=\sigma_X=\sqrt{\operatorname{Var}(X)} S D ( X ) = σ X = Var ( X ) Functions of a Random Variable ¶ If Y = g ( X ) Y=g(X) Y = g ( X ) :
PMF of Y Y Y (for discrete X X X ) :
p Y ( y ) = P ( Y = y ) = P ( g ( X ) = y ) = ∑ x : , g ( x ) = y p X ( x ) p_Y(y)=P(Y=y)=P(g(X)=y)=\sum_{x:,g(x)=y} p_X(x) p Y ( y ) = P ( Y = y ) = P ( g ( X ) = y ) = x : , g ( x ) = y ∑ p X ( x ) Expected Value of Y = g ( X ) Y=g(X) Y = g ( X ) (LOTUS - Law of the Unconscious Statistician) : For a discrete random variable X X X :
E [ Y ] = E [ g ( X ) ] = ∑ x g ( x ) ⋅ p X ( x ) E[Y]=E[g(X)]=\sum_x g(x)\cdot p_X(x) E [ Y ] = E [ g ( X )] = x ∑ g ( x ) ⋅ p X ( x ) Chapter 7: Common Discrete Distributions ¶ Bernoulli Distribution ¶ Models a single trial with two outcomes (success=1, failure=0).
Parameter: p p p (probability of success).
Binomial Distribution ¶ Models the number of successes in n n n independent Bernoulli trials.
Parameters: n n n (number of trials), p p p (probability of success on each trial).
Geometric Distribution ¶ Models the number of trials (k k k ) needed to get the first success.
Parameter: p p p (probability of success on each trial).
PMF (for X = X= X = trial number of first success):
P ( X = k ) = ( 1 − p ) k − 1 p for k = 1 , 2 , 3 , … P(X=k)=(1-p)^{k-1}p\quad \text{for }k=1,2,3,\dots P ( X = k ) = ( 1 − p ) k − 1 p for k = 1 , 2 , 3 , … Mean (trial number of first success):
E [ X ] = 1 p E[X]=\frac{1}{p} E [ X ] = p 1 Variance (trial number of first success):
Var ( X ) = 1 − p p 2 \operatorname{Var}(X)=\frac{1-p}{p^2} Var ( X ) = p 2 1 − p Negative Binomial Distribution ¶ Models the number of trials (k k k ) needed to achieve r r r successes.
Parameters: r r r (target number of successes), p p p (probability of success on each trial).
PMF (for X = X= X = trial number of r r r -th success):
P ( X = k ) = ( k − 1 r − 1 ) p r ( 1 − p ) k − r for k = r , r + 1 , r + 2 , … P(X=k)=\binom{k-1}{r-1}p^r(1-p)^{k-r}\quad \text{for }k=r,r+1,r+2,\dots P ( X = k ) = ( r − 1 k − 1 ) p r ( 1 − p ) k − r for k = r , r + 1 , r + 2 , … Mean (trial number of r r r -th success):
E [ X ] = r p E[X]=\frac{r}{p} E [ X ] = p r Variance (trial number of r r r -th success):
Var ( X ) = r ( 1 − p ) p 2 \operatorname{Var}(X)=\frac{r(1-p)}{p^2} Var ( X ) = p 2 r ( 1 − p ) Poisson Distribution ¶ Models the number of events occurring in a fixed interval of time or space.
Parameter: λ \lambda λ (average number of events in the interval).
Hypergeometric Distribution ¶ Models the number of successes in a sample of size n n n drawn without replacement from a finite population of size N N N containing K K K successes.
Parameters: N N N (population size), K K K (total successes in population), n n n (sample size).
PMF :
P ( X = k ) = ( K k ) ( N − K n − k ) ( N n ) P(X=k)=\frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}} P ( X = k ) = ( n N ) ( k K ) ( n − k N − K ) for k k k such that max ( 0 , n − ( N − K ) ) ≤ k ≤ min ( n , K ) \max(0,n-(N-K))\le k\le \min(n,K) max ( 0 , n − ( N − K )) ≤ k ≤ min ( n , K ) .
Mean :
E [ X ] = n K N E[X]=n\frac{K}{N} E [ X ] = n N K Variance :
Var ( X ) = n K N ( 1 − K N ) ( N − n N − 1 ) \operatorname{Var}(X)=n\frac{K}{N}\left(1-\frac{K}{N}\right)\left(\frac{N-n}{N-1}\right) Var ( X ) = n N K ( 1 − N K ) ( N − 1 N − n ) Chapter 8: Continuous Random Variables ¶ Probability Density Function (PDF) ¶ For a continuous random variable X X X , the PDF f X ( x ) f_X(x) f X ( x ) describes the relative likelihood of X X X .
Properties of a PDF:
f X ( x ) ≥ 0 f_X(x)\ge 0 f X ( x ) ≥ 0 for all x x x .
∫ − ∞ ∞ f X ( x ) d x = 1 \int_{-\infty}^{\infty} f_X(x)\,dx = 1 ∫ − ∞ ∞ f X ( x ) d x = 1 (total area under curve is 1).
P ( a ≤ X ≤ b ) = ∫ a b f X ( x ) d x P(a\le X\le b)=\int_a^b f_X(x)\,dx P ( a ≤ X ≤ b ) = ∫ a b f X ( x ) d x .
For any specific value c c c : P ( X = c ) = ∫ c c f X ( x ) d x = 0 P(X=c)=\int_c^c f_X(x)\,dx=0 P ( X = c ) = ∫ c c f X ( x ) d x = 0 .
Cumulative Distribution Function (CDF) ¶ For a continuous random variable X X X , the CDF F X ( x ) F_X(x) F X ( x ) is:
F X ( x ) = P ( X ≤ x ) = ∫ − ∞ x f X ( t ) , d t F_X(x)=P(X\le x)=\int_{-\infty}^{x} f_X(t),dt F X ( x ) = P ( X ≤ x ) = ∫ − ∞ x f X ( t ) , d t Properties of a CDF:
F X ( x ) F_X(x) F X ( x ) is non-decreasing.
lim x → − ∞ F X ( x ) = 0 \lim_{x\to -\infty} F_X(x)=0 x → − ∞ lim F X ( x ) = 0 lim x → ∞ F X ( x ) = 1 \lim_{x\to \infty} F_X(x)=1 x → ∞ lim F X ( x ) = 1 P ( a < X ≤ b ) = F X ( b ) − F X ( a ) P(a<X\le b)=F_X(b)-F_X(a) P ( a < X ≤ b ) = F X ( b ) − F X ( a ) .
f X ( x ) = d d x F X ( x ) f_X(x)=\frac{d}{dx}F_X(x) f X ( x ) = d x d F X ( x ) (where the derivative exists).
Expected Value (Mean) ¶ For a continuous random variable X X X :
E [ X ] = μ = ∫ − ∞ ∞ x f X ( x ) , d x E[X]=\mu=\int_{-\infty}^{\infty} x f_X(x),dx E [ X ] = μ = ∫ − ∞ ∞ x f X ( x ) , d x Variance ¶ For a continuous random variable X X X with mean μ \mu μ :
Var ( X ) = σ 2 = E [ ( X − μ ) 2 ] = ∫ − ∞ ∞ ( x − μ ) 2 f X ( x ) , d x \operatorname{Var}(X)=\sigma^2=E[(X-\mu)^2]=\int_{-\infty}^{\infty} (x-\mu)^2 f_X(x),dx Var ( X ) = σ 2 = E [( X − μ ) 2 ] = ∫ − ∞ ∞ ( x − μ ) 2 f X ( x ) , d x Computational formula:
Var ( X ) = E [ X 2 ] − ( E [ X ] ) 2 \operatorname{Var}(X)=E[X^2]-(E[X])^2 Var ( X ) = E [ X 2 ] − ( E [ X ] ) 2 Where
E [ X 2 ] = ∫ − ∞ ∞ x 2 f X ( x ) , d x . E[X^2]=\int_{-\infty}^{\infty} x^2 f_X(x),dx. E [ X 2 ] = ∫ − ∞ ∞ x 2 f X ( x ) , d x . Standard Deviation ¶ The positive square root of the variance:
σ = Var ( X ) \sigma=\sqrt{\operatorname{Var}(X)} σ = Var ( X ) Percentiles and Quantiles ¶ The p p p -th percentile x p x_p x p is the value such that F X ( x p ) = P ( X ≤ x p ) = p F_X(x_p)=P(X\le x_p)=p F X ( x p ) = P ( X ≤ x p ) = p .
The quantile function Q ( p ) Q(p) Q ( p ) is the inverse of the CDF:
Q ( p ) = F X − 1 ( p ) = x p . Q(p)=F_X^{-1}(p)=x_p. Q ( p ) = F X − 1 ( p ) = x p . Functions of a Continuous Random Variable ¶ If Y = g ( X ) Y=g(X) Y = g ( X ) :
CDF of Y Y Y :
F Y ( y ) = P ( Y ≤ y ) = P ( g ( X ) ≤ y ) F_Y(y)=P(Y\le y)=P(g(X)\le y) F Y ( y ) = P ( Y ≤ y ) = P ( g ( X ) ≤ y ) PDF of Y Y Y (Change of Variables Formula) : If g ( x ) g(x) g ( x ) is monotonic with inverse x = g − 1 ( y ) x=g^{-1}(y) x = g − 1 ( y ) , then:
f Y ( y ) = f X ( g − 1 ( y ) ) ∣ d x d y ∣ f_Y(y)=f_X(g^{-1}(y))\left|\frac{dx}{dy}\right| f Y ( y ) = f X ( g − 1 ( y )) ∣ ∣ d y d x ∣ ∣ Expected Value of Y = g ( X ) Y=g(X) Y = g ( X ) (LOTUS) :
E [ Y ] = E [ g ( X ) ] = ∫ − ∞ ∞ g ( x ) f X ( x ) , d x E[Y]=E[g(X)]=\int_{-\infty}^{\infty} g(x)f_X(x),dx E [ Y ] = E [ g ( X )] = ∫ − ∞ ∞ g ( x ) f X ( x ) , d x Chapter 9: Common Continuous Distributions ¶ X∼U(a,b)
PDF (Probability Density Function):
f ( x ; a , b ) = { 1 b − a for a ≤ x ≤ b 0 otherwise f(x; a, b) = \begin{cases} \frac{1}{b-a} & \text{for } a \le x \le b \\ 0 & \text{otherwise} \end{cases} f ( x ; a , b ) = { b − a 1 0 for a ≤ x ≤ b otherwise CDF (Cumulative Distribution Function):
F ( x ; a , b ) = P ( X ≤ x ) = { 0 for x < a x − a b − a for a ≤ x ≤ b 1 for x > b F(x; a, b) = P(X \le x) = \begin{cases} 0 & \text{for } x < a \\ \frac{x-a}{b-a} & \text{for } a \le x \le b \\ 1 & \text{for } x > b \end{cases} F ( x ; a , b ) = P ( X ≤ x ) = ⎩ ⎨ ⎧ 0 b − a x − a 1 for x < a for a ≤ x ≤ b for x > b Expected Value: E[X]=2a+b
Variance: Var(X)=12(b−a)2
2. Exponential Distribution ¶ T∼Exp(λ)
PDF (Probability Density Function):
f ( t ; λ ) = { λ e − λ t for t ≥ 0 0 for t < 0 f(t; \lambda) = \begin{cases} \lambda e^{-\lambda t} & \text{for } t \ge 0 \\ 0 & \text{for } t < 0 \end{cases} f ( t ; λ ) = { λ e − λ t 0 for t ≥ 0 for t < 0 CDF (Cumulative Distribution Function):
F ( t ; λ ) = P ( T ≤ t ) = { 1 − e − λ t for t ≥ 0 0 for t < 0 F(t; \lambda) = P(T \le t) = \begin{cases} 1 - e^{-\lambda t} & \text{for } t \ge 0 \\ 0 & \text{for } t < 0 \end{cases} F ( t ; λ ) = P ( T ≤ t ) = { 1 − e − λ t 0 for t ≥ 0 for t < 0 Survival Function: P(T>t)=1−F(t)=e−λt
Expected Value: E[T]=λ1
Variance: Var(T)=λ21
Memoryless Property: P(T>s+t∣T>s)=P(T>t) for any s,t≥0.
3. Normal (Gaussian) Distribution ¶ X∼N(μ,σ2)
4. Gamma Distribution ¶ X∼Gamma(k,λ) (using shape k and rate λ) or X∼Gamma(k,θ) (using shape k and scale θ=1/λ) The Gamma function is Γ(k)=∫0∞ xk−1e−xdx. For positive integers k, Γ(k)=(k−1)!.
PDF (Probability Density Function): Using shape k and rate λ:
f ( x ; k , λ ) = λ k x k − 1 e − λ x Γ ( k ) for x ≥ 0 f(x; k, \lambda) = \frac{\lambda^k x^{k-1} e^{-\lambda x}}{\Gamma(k)} \quad \text{for } x \ge 0 f ( x ; k , λ ) = Γ ( k ) λ k x k − 1 e − λ x for x ≥ 0 Using shape k and scale θ=1/λ:
f ( x ; k , θ ) = 1 Γ ( k ) θ k x k − 1 e − x / θ for x ≥ 0 f(x; k, \theta) = \frac{1}{\Gamma(k)\theta^k} x^{k-1} e^{-x/\theta} \quad \text{for } x \ge 0 f ( x ; k , θ ) = Γ ( k ) θ k 1 x k − 1 e − x / θ for x ≥ 0 Expected Value: E[X]=λk =kθ
Variance: Var(X)=λ2k =kθ2
5. Beta Distribution ¶ X∼Beta(α,β) The Beta function is B(α,β)=∫01 tα−1(1−t)β−1dt=Γ(α+β)Γ(α)Γ(β) .
PDF (Probability Density Function):
f ( x ; α , β ) = 1 B ( α , β ) x α − 1 ( 1 − x ) β − 1 = Γ ( α + β ) Γ ( α ) Γ ( β ) x α − 1 ( 1 − x ) β − 1 f(x; \alpha, \beta) = \frac{1}{B(\alpha, \beta)} x^{\alpha-1} (1-x)^{\beta-1} = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1} (1-x)^{\beta-1} f ( x ; α , β ) = B ( α , β ) 1 x α − 1 ( 1 − x ) β − 1 = Γ ( α ) Γ ( β ) Γ ( α + β ) x α − 1 ( 1 − x ) β − 1 for 0≤x≤1.
Expected Value: E[X]=α+βα
Variance: Var(X)=(α+β)2(α+β+1)αβ
Chapter 10: Joint Distributions ¶ Joint Probability Mass Functions (PMFs) ¶ For two discrete random variables X X X and Y Y Y :
Joint PMF Definition:
p X , Y ( x , y ) = P ( X = x , Y = y ) p_{X,Y}(x, y) = P(X=x, Y=y) p X , Y ( x , y ) = P ( X = x , Y = y ) Conditions:
p X , Y ( x , y ) ≥ 0 p_{X,Y}(x, y) \ge 0 p X , Y ( x , y ) ≥ 0 for all ( x , y ) (x, y) ( x , y )
∑ x ∑ y p X , Y ( x , y ) = 1 \sum_{x} \sum_{y} p_{X,Y}(x, y) = 1 ∑ x ∑ y p X , Y ( x , y ) = 1
Joint Probability Density Functions (PDFs) ¶ For two continuous random variables X X X and Y Y Y :
Marginal Distributions ¶ Marginal PMF of X (Discrete):
p X ( x ) = P ( X = x ) = ∑ y P ( X = x , Y = y ) = ∑ y p X , Y ( x , y ) p_X(x) = P(X=x) = \sum_{y} P(X=x, Y=y) = \sum_{y} p_{X,Y}(x, y) p X ( x ) = P ( X = x ) = y ∑ P ( X = x , Y = y ) = y ∑ p X , Y ( x , y ) Marginal PMF of Y (Discrete):
p Y ( y ) = P ( Y = y ) = ∑ x P ( X = x , Y = y ) = ∑ x p X , Y ( x , y ) p_Y(y) = P(Y=y) = \sum_{x} P(X=x, Y=y) = \sum_{x} p_{X,Y}(x, y) p Y ( y ) = P ( Y = y ) = x ∑ P ( X = x , Y = y ) = x ∑ p X , Y ( x , y ) Marginal PDF of X (Continuous):
f X ( x ) = ∫ − ∞ ∞ f X , Y ( x , y ) d y f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \,dy f X ( x ) = ∫ − ∞ ∞ f X , Y ( x , y ) d y Marginal PDF of Y (Continuous):
f Y ( y ) = ∫ − ∞ ∞ f X , Y ( x , y ) d x f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \,dx f Y ( y ) = ∫ − ∞ ∞ f X , Y ( x , y ) d x Conditional Distributions ¶ Conditional PMF of Y given X=x (Discrete):
p Y ∣ X ( y ∣ x ) = P ( Y = y ∣ X = x ) = P ( X = x , Y = y ) P ( X = x ) = p X , Y ( x , y ) p X ( x ) p_{Y|X}(y|x) = P(Y=y | X=x) = \frac{P(X=x, Y=y)}{P(X=x)} = \frac{p_{X,Y}(x, y)}{p_X(x)} p Y ∣ X ( y ∣ x ) = P ( Y = y ∣ X = x ) = P ( X = x ) P ( X = x , Y = y ) = p X ( x ) p X , Y ( x , y )
(provided p X ( x ) > 0 p_X(x) > 0 p X ( x ) > 0 )
Conditional PDF of Y given X=x (Continuous):
f Y ∣ X ( y ∣ x ) = f X , Y ( x , y ) f X ( x ) f_{Y|X}(y|x) = \frac{f_{X,Y}(x, y)}{f_X(x)} f Y ∣ X ( y ∣ x ) = f X ( x ) f X , Y ( x , y )
(provided f X ( x ) > 0 f_X(x) > 0 f X ( x ) > 0 )
Joint Cumulative Distribution Functions (CDFs) ¶ Joint CDF Definition:
F X , Y ( x , y ) = P ( X ≤ x , Y ≤ y ) F_{X,Y}(x, y) = P(X \le x, Y \le y) F X , Y ( x , y ) = P ( X ≤ x , Y ≤ y ) Discrete Case:
F X , Y ( x , y ) = ∑ x i ≤ x ∑ y j ≤ y p X , Y ( x i , y j ) F_{X,Y}(x, y) = \sum_{x_i \le x} \sum_{y_j \le y} p_{X,Y}(x_i, y_j) F X , Y ( x , y ) = x i ≤ x ∑ y j ≤ y ∑ p X , Y ( x i , y j ) Continuous Case:
F X , Y ( x , y ) = ∫ − ∞ x ∫ − ∞ y f X , Y ( u , v ) d v d u F_{X,Y}(x, y) = \int_{-\infty}^{x} \int_{-\infty}^{y} f_{X,Y}(u, v) \,dv \,du F X , Y ( x , y ) = ∫ − ∞ x ∫ − ∞ y f X , Y ( u , v ) d v d u Properties:
0 ≤ F X , Y ( x , y ) ≤ 1 0 \le F_{X,Y}(x, y) \le 1 0 ≤ F X , Y ( x , y ) ≤ 1
F X , Y ( x , y ) F_{X,Y}(x, y) F X , Y ( x , y ) is non-decreasing in both x x x and y y y .
lim x → ∞ , y → ∞ F X , Y ( x , y ) = 1 \lim_{x \to \infty, y \to \infty} F_{X,Y}(x, y) = 1 lim x → ∞ , y → ∞ F X , Y ( x , y ) = 1
lim x → − ∞ F X , Y ( x , y ) = 0 \lim_{x \to -\infty} F_{X,Y}(x, y) = 0 lim x → − ∞ F X , Y ( x , y ) = 0 and lim y → − ∞ F X , Y ( x , y ) = 0 \lim_{y \to -\infty} F_{X,Y}(x, y) = 0 lim y → − ∞ F X , Y ( x , y ) = 0
Chapter 11: Independence, Covariance, and Correlation ¶ Independence of Random Variables ¶ Two random variables X X X and Y Y Y are independent if for any sets A A A and B B B :
P ( X ∈ A , Y ∈ B ) = P ( X ∈ A ) P ( Y ∈ B ) P(X \in A, Y \in B) = P(X \in A) P(Y \in B) P ( X ∈ A , Y ∈ B ) = P ( X ∈ A ) P ( Y ∈ B ) This is equivalent to:
Covariance ¶ The covariance between two random variables X X X and Y Y Y :
Definition:
C o v ( X , Y ) = E [ ( X − E [ X ] ) ( Y − E [ Y ] ) ] \mathrm{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] Cov ( X , Y ) = E [( X − E [ X ]) ( Y − E [ Y ])] Computational Formula:
C o v ( X , Y ) = E [ X Y ] − E [ X ] E [ Y ] \mathrm{Cov}(X, Y) = E[XY] - E[X]E[Y] Cov ( X , Y ) = E [ X Y ] − E [ X ] E [ Y ] Properties:
C o v ( X , X ) = V a r ( X ) \mathrm{Cov}(X, X) = \mathrm{Var}(X) Cov ( X , X ) = Var ( X )
C o v ( X , Y ) = C o v ( Y , X ) \mathrm{Cov}(X, Y) = \mathrm{Cov}(Y, X) Cov ( X , Y ) = Cov ( Y , X )
C o v ( a X + b , c Y + d ) = a c C o v ( X , Y ) \mathrm{Cov}(aX + b, cY + d) = ac \mathrm{Cov}(X, Y) Cov ( a X + b , c Y + d ) = a c Cov ( X , Y )
C o v ( X + Y , Z ) = C o v ( X , Z ) + C o v ( Y , Z ) \mathrm{Cov}(X+Y, Z) = \mathrm{Cov}(X, Z) + \mathrm{Cov}(Y, Z) Cov ( X + Y , Z ) = Cov ( X , Z ) + Cov ( Y , Z )
If X X X and Y Y Y are independent, then C o v ( X , Y ) = 0 \mathrm{Cov}(X, Y) = 0 Cov ( X , Y ) = 0 .
Correlation Coefficient ¶ The Pearson correlation coefficient between two random variables X X X and Y Y Y :
Definition:
ρ ( X , Y ) = C o v ( X , Y ) σ X σ Y = C o v ( X , Y ) V a r ( X ) V a r ( Y ) \rho(X, Y) = \frac{\mathrm{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{\mathrm{Cov}(X, Y)}{\sqrt{\mathrm{Var}(X) \mathrm{Var}(Y)}} ρ ( X , Y ) = σ X σ Y Cov ( X , Y ) = Var ( X ) Var ( Y ) Cov ( X , Y ) Properties:
− 1 ≤ ρ ( X , Y ) ≤ 1 -1 \le \rho(X, Y) \le 1 − 1 ≤ ρ ( X , Y ) ≤ 1
ρ ( a X + b , c Y + d ) = s i g n ( a c ) ρ ( X , Y ) \rho(aX + b, cY + d) = \mathrm{sign}(ac) \rho(X, Y) ρ ( a X + b , c Y + d ) = sign ( a c ) ρ ( X , Y ) , (assuming a ≠ 0 , c ≠ 0 a \ne 0, c \ne 0 a = 0 , c = 0 )
Variance of Sums of Random Variables ¶ For any two random variables X X X and Y Y Y , and constants a a a and b b b :
General Formula:
V a r ( a X + b Y ) = a 2 V a r ( X ) + b 2 V a r ( Y ) + 2 a b C o v ( X , Y ) \mathrm{Var}(aX + bY) = a^2 \mathrm{Var}(X) + b^2 \mathrm{Var}(Y) + 2ab \mathrm{Cov}(X, Y) Var ( a X + bY ) = a 2 Var ( X ) + b 2 Var ( Y ) + 2 ab Cov ( X , Y ) Sum of Variables (a = 1 , b = 1 a=1, b=1 a = 1 , b = 1 ):
V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) + 2 C o v ( X , Y ) \mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2 \mathrm{Cov}(X, Y) Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) Difference of Variables (a = 1 , b = − 1 a=1, b=-1 a = 1 , b = − 1 ):
V a r ( X − Y ) = V a r ( X ) + V a r ( Y ) − 2 C o v ( X , Y ) \mathrm{Var}(X - Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) - 2 \mathrm{Cov}(X, Y) Var ( X − Y ) = Var ( X ) + Var ( Y ) − 2 Cov ( X , Y ) If X X X and Y Y Y are independent (C o v ( X , Y ) = 0 \mathrm{Cov}(X, Y) = 0 Cov ( X , Y ) = 0 ):
V a r ( a X + b Y ) = a 2 V a r ( X ) + b 2 V a r ( Y ) \mathrm{Var}(aX + bY) = a^2 \mathrm{Var}(X) + b^2 \mathrm{Var}(Y) Var ( a X + bY ) = a 2 Var ( X ) + b 2 Var ( Y )
V a r ( X + Y ) = V a r ( X ) + V a r ( Y ) \mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) Var ( X + Y ) = Var ( X ) + Var ( Y )
V a r ( X − Y ) = V a r ( X ) + V a r ( Y ) \mathrm{Var}(X - Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) Var ( X − Y ) = Var ( X ) + Var ( Y ) Extension to Multiple Variables (X 1 , X 2 , . . . , X n X_1, X_2, ..., X_n X 1 , X 2 , ... , X n ):
V a r ( ∑ i = 1 n a i X i ) = ∑ i = 1 n a i 2 V a r ( X i ) + ∑ i ≠ j a i a j C o v ( X i , X j ) \mathrm{Var}\left(\sum_{i=1}^n a_i X_i\right) = \sum_{i=1}^n a_i^2 \mathrm{Var}(X_i) + \sum_{i \ne j} a_i a_j \mathrm{Cov}(X_i, X_j) Var ( i = 1 ∑ n a i X i ) = i = 1 ∑ n a i 2 Var ( X i ) + i = j ∑ a i a j Cov ( X i , X j )
or
V a r ( ∑ i = 1 n a i X i ) = ∑ i = 1 n a i 2 V a r ( X i ) + 2 ∑ i < j a i a j C o v ( X i , X j ) \mathrm{Var}\left(\sum_{i=1}^n a_i X_i\right) = \sum_{i=1}^n a_i^2 \mathrm{Var}(X_i) + 2 \sum_{i < j} a_i a_j \mathrm{Cov}(X_i, X_j) Var ( i = 1 ∑ n a i X i ) = i = 1 ∑ n a i 2 Var ( X i ) + 2 i < j ∑ a i a j Cov ( X i , X j ) If all X i X_i X i are independent:
V a r ( ∑ i = 1 n a i X i ) = ∑ i = 1 n a i 2 V a r ( X i ) \mathrm{Var}\left(\sum_{i=1}^n a_i X_i\right) = \sum_{i=1}^n a_i^2 \mathrm{Var}(X_i) Var ( i = 1 ∑ n a i X i ) = i = 1 ∑ n a i 2 Var ( X i ) Chapter 12: Functions of Multiple Random Variables ¶ Sums of Independent Random Variables (Convolution) ¶ Let X X X and Y Y Y be two random variables, and Z = X + Y Z = X+Y Z = X + Y .
Discrete Case (PMF of Z):
P ( Z = z ) = ∑ k P ( X = k , Y = z − k ) P(Z=z) = \sum_{k} P(X=k, Y=z-k) P ( Z = z ) = k ∑ P ( X = k , Y = z − k ) If X X X and Y Y Y are independent:
P ( Z = z ) = ∑ k P ( X = k ) P ( Y = z − k ) P(Z=z) = \sum_{k} P(X=k)P(Y=z-k) P ( Z = z ) = k ∑ P ( X = k ) P ( Y = z − k ) This is the discrete convolution of the PMFs.
Continuous Case (PDF of Z):
f Z ( z ) = ∫ − ∞ ∞ f X , Y ( x , z − x ) d x f_Z(z) = \int_{-\infty}^{\infty} f_{X,Y}(x, z-x)dx f Z ( z ) = ∫ − ∞ ∞ f X , Y ( x , z − x ) d x If X X X and Y Y Y are independent:
f Z ( z ) = ∫ − ∞ ∞ f X ( x ) f Y ( z − x ) d x = ( f X ∗ f Y ) ( z ) f_Z(z) = \int_{-\infty}^{\infty} f_X(x)f_Y(z-x)dx = (f_X * f_Y)(z) f Z ( z ) = ∫ − ∞ ∞ f X ( x ) f Y ( z − x ) d x = ( f X ∗ f Y ) ( z ) This is the convolution of the PDFs.
If Y 1 = g 1 ( X 1 , X 2 ) Y_1 = g_1(X_1, X_2) Y 1 = g 1 ( X 1 , X 2 ) and Y 2 = g 2 ( X 1 , X 2 ) Y_2 = g_2(X_1, X_2) Y 2 = g 2 ( X 1 , X 2 ) are transformations of random variables X 1 , X 2 X_1, X_2 X 1 , X 2 , and these transformations are invertible such that X 1 = h 1 ( Y 1 , Y 2 ) X_1 = h_1(Y_1, Y_2) X 1 = h 1 ( Y 1 , Y 2 ) and X 2 = h 2 ( Y 1 , Y 2 ) X_2 = h_2(Y_1, Y_2) X 2 = h 2 ( Y 1 , Y 2 ) .
Joint PDF of Y 1 , Y 2 Y_1, Y_2 Y 1 , Y 2 :
f Y 1 , Y 2 ( y 1 , y 2 ) = f X 1 , X 2 ( h 1 ( y 1 , y 2 ) , h 2 ( y 1 , y 2 ) ) ∣ J ∣ f_{Y_1,Y_2}(y_1, y_2) = f_{X_1,X_2}(h_1(y_1,y_2), h_2(y_1,y_2)) |J| f Y 1 , Y 2 ( y 1 , y 2 ) = f X 1 , X 2 ( h 1 ( y 1 , y 2 ) , h 2 ( y 1 , y 2 )) ∣ J ∣ Where ∣ J ∣ |J| ∣ J ∣ is the absolute value of the determinant of the Jacobian matrix.
Jacobian Determinant (J):
J = det ( ∂ x 1 ∂ y 1 ∂ x 1 ∂ y 2 ∂ x 2 ∂ y 1 ∂ x 2 ∂ y 2 ) J = \det \begin{pmatrix}
\frac{\partial x_1}{\partial y_1} & \frac{\partial x_1}{\partial y_2} \\
\frac{\partial x_2}{\partial y_1} & \frac{\partial x_2}{\partial y_2}
\end{pmatrix} J = det ( ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ) Order Statistics ¶ Let X 1 , X 2 , … , X n X_1, X_2, \dots, X_n X 1 , X 2 , … , X n be n n n independent and identically distributed (i.i.d.) random variables with CDF F X ( x ) F_X(x) F X ( x ) and PDF f X ( x ) f_X(x) f X ( x ) . Let X ( 1 ) , X ( 2 ) , … , X ( n ) X_{(1)}, X_{(2)}, \dots, X_{(n)} X ( 1 ) , X ( 2 ) , … , X ( n ) be the order statistics (sorted values).
CDF of the Maximum (Y n = X ( n ) Y_n = X_{(n)} Y n = X ( n ) ):
F Y n ( y ) = P ( X ( n ) ≤ y ) = [ F X ( y ) ] n F_{Y_n}(y) = P(X_{(n)} \le y) = [F_X(y)]^n F Y n ( y ) = P ( X ( n ) ≤ y ) = [ F X ( y ) ] n PDF of the Maximum (Y n = X ( n ) Y_n = X_{(n)} Y n = X ( n ) ):
f Y n ( y ) = n [ F X ( y ) ] n − 1 f X ( y ) f_{Y_n}(y) = n[F_X(y)]^{n-1}f_X(y) f Y n ( y ) = n [ F X ( y ) ] n − 1 f X ( y ) CDF of the Minimum (Y 1 = X ( 1 ) Y_1 = X_{(1)} Y 1 = X ( 1 ) ):
F Y 1 ( y ) = P ( X ( 1 ) ≤ y ) = 1 − [ 1 − F X ( y ) ] n F_{Y_1}(y) = P(X_{(1)} \le y) = 1 - [1-F_X(y)]^n F Y 1 ( y ) = P ( X ( 1 ) ≤ y ) = 1 − [ 1 − F X ( y ) ] n PDF of the Minimum (Y 1 = X ( 1 ) Y_1 = X_{(1)} Y 1 = X ( 1 ) ):
f Y 1 ( y ) = n [ 1 − F X ( y ) ] n − 1 f X ( y ) f_{Y_1}(y) = n[1-F_X(y)]^{n-1}f_X(y) f Y 1 ( y ) = n [ 1 − F X ( y ) ] n − 1 f X ( y ) PDF of the k k k -th Order Statistic (Y k = X ( k ) Y_k = X_{(k)} Y k = X ( k ) ):
f Y k ( y ) = n ! ( k − 1 ) ! ( n − k ) ! [ F X ( y ) ] k − 1 [ 1 − F X ( y ) ] n − k f X ( y ) f_{Y_k}(y) = \frac{n!}{(k-1)!(n-k)!} [F_X(y)]^{k-1} [1-F_X(y)]^{n-k} f_X(y) f Y k ( y ) = ( k − 1 )! ( n − k )! n ! [ F X ( y ) ] k − 1 [ 1 − F X ( y ) ] n − k f X ( y ) Chapter 13: The Law of Large Numbers (LLN) ¶ Chebyshev’s Inequality ¶ For a random variable X X X with mean μ \mu μ and finite variance σ 2 \sigma^2 σ 2 :
Weak Law of Large Numbers (WLLN) ¶ For a sequence of i.i.d. random variables X 1 , X 2 , … , X n X_1, X_2, \dots, X_n X 1 , X 2 , … , X n with common mean E [ X i ] = μ E[X_i] = \mu E [ X i ] = μ and common finite variance V a r ( X i ) = σ 2 Var(X_i) = \sigma^2 Va r ( X i ) = σ 2 . Let X ˉ n = 1 n ∑ i = 1 n X i \bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i X ˉ n = n 1 ∑ i = 1 n X i be the sample mean.
Strong Law of Large Numbers (SLLN) ¶ For a sequence of i.i.d. random variables X 1 , X 2 , … , X n X_1, X_2, \dots, X_n X 1 , X 2 , … , X n with common mean E [ X i ] = μ E[X_i] = \mu E [ X i ] = μ .
Chapter 14: The Central Limit Theorem (CLT) ¶ Statement of CLT (Lindeberg-Lévy CLT):
Let X 1 , X 2 , … , X n X_1, X_2, \dots, X_n X 1 , X 2 , … , X n be i.i.d. random variables with mean μ \mu μ and variance σ 2 \sigma^2 σ 2 . Let X ˉ n \bar{X}_n X ˉ n be the sample mean.
Z n = X ˉ n − μ σ / n → d N ( 0 , 1 ) Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0, 1) Z n = σ / n X ˉ n − μ d N ( 0 , 1 ) (where → d \xrightarrow{d} d denotes convergence in distribution)
Convergence in Distribution (for Z n Z_n Z n ):
lim n → ∞ P ( Z n ≤ z ) = Φ ( z ) \lim_{n \to \infty} P(Z_n \le z) = \Phi(z) n → ∞ lim P ( Z n ≤ z ) = Φ ( z ) where Φ ( z ) \Phi(z) Φ ( z ) is the CDF of the standard Normal distribution N ( 0 , 1 ) N(0, 1) N ( 0 , 1 ) .
Approximation for Sample Mean X ˉ n \bar{X}_n X ˉ n :
P ( X ˉ n ≤ x ) ≈ Φ ( x − μ σ / n ) P(\bar{X}_n \le x) \approx \Phi\left(\frac{x - \mu}{\sigma/\sqrt{n}}\right) P ( X ˉ n ≤ x ) ≈ Φ ( σ / n x − μ ) CLT for Sums (S n = ∑ i = 1 n X i S_n = \sum_{i=1}^{n} X_i S n = ∑ i = 1 n X i ):
E [ S n ] = n μ E[S_n] = n\mu E [ S n ] = n μ , V a r ( S n ) = n σ 2 Var(S_n) = n\sigma^2 Va r ( S n ) = n σ 2 .
S n − n μ n σ → d N ( 0 , 1 ) \frac{S_n - n\mu}{\sqrt{n}\sigma} \xrightarrow{d} N(0, 1) n σ S n − n μ d N ( 0 , 1 ) Normal Approximation to Binomial Distribution:
For X ∼ Binomial ( n , p ) X \sim \text{Binomial}(n, p) X ∼ Binomial ( n , p ) :
Mean: E [ X ] = n p E[X] = np E [ X ] = n p
Variance: V a r ( X ) = n p ( 1 − p ) Var(X) = np(1-p) Va r ( X ) = n p ( 1 − p )
Approximation: X ≈ N ( n p , n p ( 1 − p ) ) X \approx N(np, np(1-p)) X ≈ N ( n p , n p ( 1 − p )) (if n p ≥ 5 np \ge 5 n p ≥ 5 and n ( 1 − p ) ≥ 5 n(1-p) \ge 5 n ( 1 − p ) ≥ 5 is a common rule of thumb).
Continuity Correction (for approximating discrete with continuous):
To approximate P ( X ≤ k ) P(X \le k) P ( X ≤ k ) , use P ( Y ≤ k + 0.5 ) P(Y \le k + 0.5) P ( Y ≤ k + 0.5 )
To approximate P ( X ≥ k ) P(X \ge k) P ( X ≥ k ) , use P ( Y ≥ k − 0.5 ) P(Y \ge k - 0.5) P ( Y ≥ k − 0.5 )
To approximate P ( X = k ) P(X = k) P ( X = k ) , use P ( k − 0.5 ≤ Y ≤ k + 0.5 ) P(k - 0.5 \le Y \le k + 0.5) P ( k − 0.5 ≤ Y ≤ k + 0.5 )
To approximate P ( a ≤ X ≤ b ) P(a \le X \le b) P ( a ≤ X ≤ b ) , use P ( a − 0.5 ≤ Y ≤ b + 0.5 ) P(a - 0.5 \le Y \le b + 0.5) P ( a − 0.5 ≤ Y ≤ b + 0.5 )
Chapter 15: Introduction to Bayesian Inference ¶ Bayes’ Theorem for Distributions:
p ( θ ∣ D ) = p ( D ∣ θ ) p ( θ ) p ( D ) p(\theta | D) = \frac{p(D | \theta) p(\theta)}{p(D)} p ( θ ∣ D ) = p ( D ) p ( D ∣ θ ) p ( θ ) Where:
p ( θ ∣ D ) p(\theta | D) p ( θ ∣ D ) is the posterior probability of parameter θ \theta θ given data D D D .
p ( D ∣ θ ) p(D | \theta) p ( D ∣ θ ) is the likelihood of data D D D given parameter θ \theta θ .
p ( θ ) p(\theta) p ( θ ) is the prior probability of parameter θ \theta θ .
p ( D ) p(D) p ( D ) is the evidence (or marginal likelihood of data).
Proportionality Form:
Posterior ∝ Likelihood × Prior \text{Posterior} \propto \text{Likelihood} \times \text{Prior} Posterior ∝ Likelihood × Prior Evidence Calculation:
For continuous θ \theta θ : p ( D ) = ∫ p ( D ∣ θ ) p ( θ ) d θ p(D) = \int p(D|\theta) p(\theta) d\theta p ( D ) = ∫ p ( D ∣ θ ) p ( θ ) d θ
For discrete θ \theta θ : p ( D ) = ∑ θ p ( D ∣ θ ) p ( θ ) p(D) = \sum_{\theta} p(D|\theta) p(\theta) p ( D ) = ∑ θ p ( D ∣ θ ) p ( θ )
Beta-Binomial Conjugate Prior Update:
If prior is Beta ( α p r i o r , β p r i o r ) \text{Beta}(\alpha_{prior}, \beta_{prior}) Beta ( α p r i or , β p r i or ) and data is k k k successes in n n n trials (Binomial likelihood):
Posterior is Beta ( α p o s t e r i o r , β p o s t e r i o r ) = Beta ( α p r i o r + k , β p r i o r + n − k ) \text{Beta}(\alpha_{posterior}, \beta_{posterior}) = \text{Beta}(\alpha_{prior} + k, \beta_{prior} + n - k) Beta ( α p os t er i or , β p os t er i or ) = Beta ( α p r i or + k , β p r i or + n − k )
Point Estimates from Posterior:
Maximum a Posteriori (MAP) Estimate:
θ ^ M A P = arg max θ p ( θ ∣ D ) \hat{\theta}_{MAP} = \arg \max_{\theta} p(\theta | D) θ ^ M A P = arg θ max p ( θ ∣ D )
For a Beta( α , β ) (\alpha, \beta) ( α , β ) posterior (if α > 1 , β > 1 \alpha > 1, \beta > 1 α > 1 , β > 1 ):
θ ^ M A P = α − 1 α + β − 2 \hat{\theta}_{MAP} = \frac{\alpha - 1}{\alpha + \beta - 2} θ ^ M A P = α + β − 2 α − 1 Posterior Mean:
θ ^ M e a n = E [ θ ∣ D ] = ∫ θ p ( θ ∣ D ) d θ \hat{\theta}_{Mean} = E[\theta | D] = \int \theta p(\theta | D) d\theta θ ^ M e an = E [ θ ∣ D ] = ∫ θp ( θ ∣ D ) d θ
For a Beta( α , β ) (\alpha, \beta) ( α , β ) posterior:
θ ^ M e a n = α α + β \hat{\theta}_{Mean} = \frac{\alpha}{\alpha + \beta} θ ^ M e an = α + β α Credible Interval:
An interval [ L , U ] [L, U] [ L , U ] such that:
P ( L ≤ θ ≤ U ∣ D ) = ∫ L U p ( θ ∣ D ) d θ = 1 − γ P(L \le \theta \le U | D) = \int_L^U p(\theta | D) d\theta = 1 - \gamma P ( L ≤ θ ≤ U ∣ D ) = ∫ L U p ( θ ∣ D ) d θ = 1 − γ (where 1 − γ 1-\gamma 1 − γ is the credibility level, e.g., 95%)
Chapter 16: Introduction to Markov Chains ¶ Transition Probability (from state i i i to state j j j ):
P i j = P ( X t + 1 = s j ∣ X t = s i ) P_{ij} = P(X_{t+1} = s_j | X_t = s_i) P ij = P ( X t + 1 = s j ∣ X t = s i ) n-Step Transition Probability:
The ( i , j ) (i, j) ( i , j ) -th entry of the matrix P n P^n P n (the transition matrix P P P raised to the power of n n n ):
P i j ( n ) = P ( X t + n = s j ∣ X t = s i ) = ( P n ) i j P^{(n)}_{ij} = P(X_{t+n} = s_j | X_t = s_i) = (P^n)_{ij} P ij ( n ) = P ( X t + n = s j ∣ X t = s i ) = ( P n ) ij Stationary Distribution (π \pi π ):
A row vector π = [ π 1 , π 2 , . . . , π k ] \pi = [\pi_1, \pi_2, ..., \pi_k] π = [ π 1 , π 2 , ... , π k ] such that:
and
∑ j = 1 k π j = 1 \sum_{j=1}^{k} \pi_j = 1 j = 1 ∑ k π j = 1 Chapter 17: Monte Carlo Methods ¶ Estimating Probability P ( A ) P(A) P ( A ) :
P ( A ) ≈ N A N P(A) \approx \frac{N_A}{N} P ( A ) ≈ N N A (where N A N_A N A is the number of times event A occurred in N N N simulations)
Estimating Expected Value E [ g ( X ) ] E[g(X)] E [ g ( X )] :
E [ g ( X ) ] ≈ 1 N ∑ i = 1 N g ( X i ) E[g(X)] \approx \frac{1}{N} \sum_{i=1}^{N} g(X_i) E [ g ( X )] ≈ N 1 i = 1 ∑ N g ( X i ) (where X i X_i X i are samples from the distribution of X X X )
Monte Carlo Integration (Hit-or-Miss for Area):
Area ( A ) ≈ Area ( B ) × N h i t N \text{Area}(A) \approx \text{Area}(B) \times \frac{N_{hit}}{N} Area ( A ) ≈ Area ( B ) × N N hi t Monte Carlo Integration (Using Expected Values for I = ∫ a b g ( x ) d x I = \int_a^b g(x) dx I = ∫ a b g ( x ) d x ):
If X ∼ Uniform ( a , b ) X \sim \text{Uniform}(a, b) X ∼ Uniform ( a , b ) :
I ≈ ( b − a ) × 1 N ∑ i = 1 N g ( X i ) I \approx (b-a) \times \frac{1}{N} \sum_{i=1}^{N} g(X_i) I ≈ ( b − a ) × N 1 i = 1 ∑ N g ( X i ) Inverse Transform Method for Generating Random Variables:
If U ∼ Uniform ( 0 , 1 ) U \sim \text{Uniform}(0, 1) U ∼ Uniform ( 0 , 1 ) , then X = F − 1 ( U ) X = F^{-1}(U) X = F − 1 ( U ) has CDF F ( x ) F(x) F ( x ) .
Acceptance-Rejection Method:
To sample from target PDF f ( x ) f(x) f ( x ) using proposal PDF g ( x ) g(x) g ( x ) where f ( x ) ≤ c ⋅ g ( x ) f(x) \le c \cdot g(x) f ( x ) ≤ c ⋅ g ( x ) :
Sample y y y from g ( x ) g(x) g ( x ) .
Sample u u u from Uniform ( 0 , 1 ) \text{Uniform}(0, 1) Uniform ( 0 , 1 ) .
Accept y y y if u ≤ f ( y ) c ⋅ g ( y ) u \le \frac{f(y)}{c \cdot g(y)} u ≤ c ⋅ g ( y ) f ( y ) .
Buffon’s Needle Problem (L ≤ D L \le D L ≤ D ):
Probability of needle crossing a line:
P ( cross ) = 2 L π D P(\text{cross}) = \frac{2L}{\pi D} P ( cross ) = π D 2 L For L = 1 , D = 2 L=1, D=2 L = 1 , D = 2 :
π ≈ 1 P ( cross ) \pi \approx \frac{1}{P(\text{cross})} π ≈ P ( cross ) 1 Chapter 18: (Optional) Further Explorations ¶ Entropy H ( X ) H(X) H ( X ) (for discrete random variable X X X with PMF p ( x ) p(x) p ( x ) ):
H ( X ) = − ∑ x p ( x ) log b p ( x ) H(X) = - \sum_{x} p(x) \log_b p(x) H ( X ) = − x ∑ p ( x ) log b p ( x ) (Base b b b is often 2 for bits, or e e e for nats)
Kullback-Leibler (KL) Divergence (for discrete distributions P P P and Q Q Q ):
D K L ( P ∣ ∣ Q ) = ∑ x P ( x ) log b P ( x ) Q ( x ) D_{KL}(P || Q) = \sum_{x} P(x) \log_b \frac{P(x)}{Q(x)} D K L ( P ∣∣ Q ) = x ∑ P ( x ) log b Q ( x ) P ( x ) Geometric Brownian Motion (GBM) S ( t ) S(t) S ( t ) :
Stochastic Differential Equation: d S ( t ) = μ S ( t ) d t + σ S ( t ) d W ( t ) dS(t) = \mu S(t) dt + \sigma S(t) dW(t) d S ( t ) = μ S ( t ) d t + σ S ( t ) d W ( t )
Solution: S ( t ) = S ( 0 ) exp ( ( μ − σ 2 2 ) t + σ W ( t ) ) S(t) = S(0) \exp\left( \left(\mu - \frac{\sigma^2}{2}\right)t + \sigma W(t) \right) S ( t ) = S ( 0 ) exp ( ( μ − 2 σ 2 ) t + σW ( t ) )
Probability Generating Function (PGF) G X ( z ) G_X(z) G X ( z ) (for non-negative integer-valued RV X X X ):
G X ( z ) = E [ z X ] = ∑ k = 0 ∞ P ( X = k ) z k G_X(z) = E[z^X] = \sum_{k=0}^{\infty} P(X=k) z^k G X ( z ) = E [ z X ] = k = 0 ∑ ∞ P ( X = k ) z k Moment Generating Function (MGF) M X ( t ) M_X(t) M X ( t ) :
M X ( t ) = E [ e t X ] M_X(t) = E[e^{tX}] M X ( t ) = E [ e tX ] E [ X n ] = M X ( n ) ( 0 ) E[X^n] = M_X^{(n)}(0) E [ X n ] = M X ( n ) ( 0 ) (n-th derivative evaluated at t = 0 t=0 t = 0 )
For independent X , Y X, Y X , Y : M X + Y ( t ) = M X ( t ) M Y ( t ) M_{X+Y}(t) = M_X(t) M_Y(t) M X + Y ( t ) = M X ( t ) M Y ( t )