Probability for Machine Learning¶
Probability is the mathematics of uncertainty. It gives us a precise language for talking about things we don't know for sure.
Machine learning models deal with uncertainty constantly:
- Is this email spam or not? A spam filter assigns a probability like 0.92 ("I'm 92% sure it's spam").
- What digit is in this image? A digit recognizer might say: 80% chance it's a 7, 15% chance it's a 1, 5% everything else.
- Will this customer churn? A model estimates the probability of leaving.
Every prediction a model makes is, at its core, a probability estimate. Understanding probability isn't optional for machine learning — it's foundational.
Prerequisites: High school algebra is all you need. We assume no prior knowledge of probability or statistics. Every concept will be explained from scratch, with intuition first and formulas second.
Let's get started!
# Standard imports for numerical computing and visualization
import numpy as np # numerical operations and random sampling
import matplotlib.pyplot as plt # plotting and visualization
# Set a random seed for reproducibility — so you get the same results each time
np.random.seed(42)
# Make plots look a bit nicer
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['figure.dpi'] = 100
1. What is Probability?¶
You already use probability in everyday life, even if you don't realize it:
- "There's a 50% chance of heads when flipping a coin."
- "There's a 30% chance of rain tomorrow."
- "A spam filter says there's a 95% probability this email is spam."
Probability takes this intuitive notion of "how likely is something?" and makes it precise and mathematical.
Key Vocabulary¶
Before we write any formulas, let's define some words:
| Term | Definition | Example |
|---|---|---|
| Experiment | An action with uncertain outcomes | Flipping a coin, rolling a die |
| Sample Space $\Omega$ | The set of ALL possible outcomes | $\Omega = \{H, T\}$ for a coin; $\Omega = \{1, 2, 3, 4, 5, 6\}$ for a die |
| Event | A specific outcome or set of outcomes | $A$ = "getting heads"; $B$ = "rolling an even number" = $\{2, 4, 6\}$ |
| Probability $P(A)$ | A number between 0 and 1 representing how likely event $A$ is | $P(\text{heads}) = 0.5$ |
Think of it this way: the sample space is a bag containing every possible result. An event is you reaching in and pulling out a specific result (or set of results). The probability tells you how likely you are to pull that one out.
The Three Axioms of Probability¶
All of probability theory rests on just three simple rules (called axioms):
Axiom 1: Probabilities are between 0 and 1
$$0 \leq P(A) \leq 1$$
- $P(A) = 0$ means event $A$ is impossible (it can never happen)
- $P(A) = 1$ means event $A$ is certain (it will definitely happen)
- Everything else is somewhere in between
Axiom 2: Something must happen
$$P(\Omega) = 1$$
The probability of the entire sample space is 1. In plain English: some outcome must occur. When you flip a coin, you WILL get either heads or tails — there's no "nothing happens" option.
Axiom 3: Addition rule for mutually exclusive events
If events $A$ and $B$ cannot both happen at the same time (they are "mutually exclusive"), then:
$$P(A \text{ or } B) = P(A) + P(B)$$
For example, when rolling a die, "rolling a 2" and "rolling a 5" can't happen simultaneously. So:
$$P(\text{2 or 5}) = P(\text{2}) + P(\text{5}) = \dfrac{1}{6} + \dfrac{1}{6} = \dfrac{2}{6} = \dfrac{1}{3}$$
That's it! Every probability result you'll ever encounter is built on just these three rules.
# --- Simulating Coin Flips ---
# np.random.choice picks randomly from the given list
coin_flips = np.random.choice(['H', 'T'], size=10000) # flip a coin 10,000 times
# Count how many heads we got
num_heads = np.sum(coin_flips == 'H')
proportion_heads = num_heads / len(coin_flips)
print("=== Coin Flip Simulation (10,000 flips) ===")
print(f"Number of Heads: {num_heads}")
print(f"Number of Tails: {len(coin_flips) - num_heads}")
print(f"Proportion of Heads: {proportion_heads:.4f}")
print(f"Expected proportion: 0.5000")
print(f"Notice how close the simulated proportion is to 0.5!")
print("\n")
# --- Simulating Die Rolls ---
die_rolls = np.random.choice([1, 2, 3, 4, 5, 6], size=10000) # roll a die 10,000 times
print("=== Die Roll Simulation (10,000 rolls) ===")
for face in range(1, 7):
count = np.sum(die_rolls == face) # count occurrences of each face
proportion = count / len(die_rolls) # compute proportion
print(f"Face {face}: appeared {count:>4d} times, proportion = {proportion:.4f} (expected: {1/6:.4f})")
# Visualization: The Law of Large Numbers for coin flips
# As we flip more coins, the running proportion of heads converges to 0.5
n_flips = 1000
flips = np.random.choice([0, 1], size=n_flips) # 0 = Tails, 1 = Heads
# Compute the running proportion: after flip 1, after flip 2, ..., after flip n
running_proportion = np.cumsum(flips) / np.arange(1, n_flips + 1)
# Plot the running proportion over time
plt.figure(figsize=(10, 5))
plt.plot(range(1, n_flips + 1), running_proportion, color='steelblue', linewidth=1.0,
label='Running proportion of Heads')
plt.axhline(y=0.5, color='red', linestyle='--', linewidth=1.5, label='True probability (0.5)')
plt.xlabel('Number of flips', fontsize=12)
plt.ylabel('Proportion of Heads', fontsize=12)
plt.title('Coin Flips: Running Proportion of Heads Converges to 0.5', fontsize=14)
plt.legend(fontsize=11)
plt.ylim(0, 1) # probability is between 0 and 1
plt.grid(alpha=0.3) # light grid for readability
plt.tight_layout()
plt.show()
print("Notice: early on the proportion bounces around wildly.")
print("But as the number of flips grows, it settles closer and closer to 0.5.")
print("This is the FREQUENTIST interpretation: probability is the long-run frequency.")
2. Conditional Probability¶
The Puzzle That Tricks Almost Everyone¶
Suppose:
- 1% of the population has a certain disease.
- A diagnostic test is 95% accurate: if you HAVE the disease, the test correctly says "positive" 95% of the time.
- But the test also has a 10% false positive rate: if you DON'T have the disease, the test incorrectly says "positive" 10% of the time.
You test positive. What's the probability you actually have the disease?
Most people instinctively answer "about 95%" — but the real answer is shockingly low: only about 8.8%! We'll see why in Section 3 when we learn Bayes' theorem. But first, we need to understand conditional probability.
The Formula¶
Conditional probability answers the question: "Given that B has already happened, what's the probability of A?"
$$P(A | B) = \dfrac{P(A \cap B)}{P(B)}$$
Let's break down every symbol:
| Symbol | What it means |
|---|---|
| $P(A \mid B)$ | Probability of $A$ given that $B$ has happened. The vertical bar "$\mid$" is read as "given". |
| $P(A \cap B)$ | Probability that both $A$ and $B$ happen. The $\cap$ symbol means "and" (intersection). |
| $P(B)$ | Probability of $B$ happening (must be greater than 0, since we're told $B$ happened). |
Think of it this way: once we know $B$ happened, our "world" shrinks from all possible outcomes to just the outcomes where $B$ is true. Within that smaller world, we ask: how much of it also has $A$?
Worked Example: Drawing Cards¶
A standard deck has 52 cards. You draw one card.
- Let $A$ = "the card is a King"
- Let $B$ = "the card is a face card" (Jack, Queen, or King)
Question: Given that you drew a face card, what's the probability it's a King?
Step 1: Find $P(A \cap B)$. A King IS a face card, so $A \cap B$ = "the card is a King" = $A$. $$P(A \cap B) = P(\text{King}) = \dfrac{4}{52} = \dfrac{1}{13}$$
Step 2: Find $P(B)$. There are 12 face cards (4 Jacks + 4 Queens + 4 Kings). $$P(B) = \dfrac{12}{52} = \dfrac{3}{13}$$
Step 3: Apply the formula. $$P(A|B) = \dfrac{P(A \cap B)}{P(B)} = \dfrac{1/13}{3/13} = \dfrac{1}{3} \approx 0.333$$
This makes intuitive sense! Among the face cards (Jack, Queen, King), Kings make up one-third.
Independence¶
Two events are independent if knowing that one happened doesn't change the probability of the other.
Mathematically, $A$ and $B$ are independent if and only if:
$$P(A \cap B) = P(A) \cdot P(B)$$
This also means $P(A|B) = P(A)$ — learning that $B$ happened gives you no new information about $A$.
Example of independence: Flipping a coin and rolling a die. Getting heads doesn't affect what number you roll. So:
$$P(\text{Heads AND roll 6}) = P(\text{Heads}) \cdot P(\text{roll 6}) = \dfrac{1}{2} \cdot \dfrac{1}{6} = \dfrac{1}{12}$$
Counter-example (NOT independent): Drawing cards without replacement.
- $P(\text{1st card is Ace}) = \dfrac{4}{52}$
- If the 1st card WAS an Ace: $P(\text{2nd card is Ace}) = \dfrac{3}{51}$ (only 3 aces left among 51 cards)
- If the 1st card was NOT an Ace: $P(\text{2nd card is Ace}) = \dfrac{4}{51}$
The probability of the 2nd draw depends on the 1st draw. These events are not independent.
Why this matters for ML: Many models assume features are independent (e.g., Naive Bayes). This assumption is often wrong, but it simplifies the math enormously and works surprisingly well in practice.
3. Bayes' Theorem — Updating Beliefs with Evidence¶
Bayes' theorem is one of the most powerful ideas in all of mathematics. It tells us how to update our beliefs when we get new evidence.
Think of it this way: you start with some initial belief about how likely something is (your prior). Then you observe some evidence. Bayes' theorem tells you exactly how to combine your prior belief with the evidence to get an updated belief (your posterior).
The Formula¶
$$P(A | B) = \dfrac{P(B | A) \cdot P(A)}{P(B)}$$
Let's use the disease testing example from Section 2 to understand every term:
- $A$ = "person has the disease"
- $B$ = "person tests positive"
| Symbol | Name | Meaning | Value |
|---|---|---|---|
| $P(A \mid B)$ | Posterior | Probability of having the disease given a positive test. This is what we want to find. | ??? |
| $P(B \mid A)$ | Likelihood | Probability of testing positive if you have the disease. This is the test's sensitivity. | $0.95$ |
| $P(A)$ | Prior | Probability of having the disease before any test — the prevalence in the population. | $0.01$ |
| $P(B)$ | Evidence | Probability of testing positive overall (whether or not you have the disease). | Need to calculate |
Step 1: Calculate $P(B)$ using the Law of Total Probability¶
A person can test positive in two ways:
- They have the disease AND the test correctly detects it
- They don't have the disease AND the test incorrectly flags them
$$P(B) = P(B|A) \cdot P(A) + P(B|\text{not } A) \cdot P(\text{not } A)$$
Plugging in our numbers:
$$P(B) = 0.95 \times 0.01 + 0.10 \times 0.99$$ $$= 0.0095 + 0.099$$ $$= 0.1085$$
So about 10.85% of people will test positive overall.
Step 2: Apply Bayes' Theorem¶
$$P(A|B) = \dfrac{P(B|A) \cdot P(A)}{P(B)} = \dfrac{0.95 \times 0.01}{0.1085} = \dfrac{0.0095}{0.1085} \approx 0.0876$$
The Answer: Only about 8.76%!¶
Despite the test being 95% accurate, if you test positive, there's less than a 9% chance you actually have the disease.
Why Is the Answer So Low?¶
The disease is very rare (only 1% of people have it). This means:
- Out of 10,000 people: ~100 have the disease, ~9,900 don't.
- Of the 100 with the disease: 95 test positive (true positives).
- Of the 9,900 without the disease: 990 test positive (false positives!).
- Total positive tests: $95 + 990 = 1085$.
- Of those 1085 positive tests, only 95 actually have the disease: $\dfrac{95}{1085} \approx 8.76\%$.
The false positives overwhelm the true positives because there are so many more healthy people than sick people.
ML connection: This is exactly the problem of class imbalance. When one class is rare (fraud, disease, defects), accuracy alone is misleading. A model that always predicts "no fraud" is 99.9% accurate but completely useless!
# === Bayes' Theorem: Analytical Calculation ===
# Given values
p_disease = 0.01 # P(A): prior probability of having the disease
p_no_disease = 1 - p_disease # P(not A)
p_positive_given_disease = 0.95 # P(B|A): sensitivity (true positive rate)
p_positive_given_no_disease = 0.10 # P(B|not A): false positive rate
# Step 1: Calculate P(B) using the law of total probability
p_positive = (p_positive_given_disease * p_disease +
p_positive_given_no_disease * p_no_disease)
# Step 2: Apply Bayes' theorem
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive
print("=== Analytical Calculation (Bayes' Theorem) ===")
print(f"P(disease) = {p_disease}")
print(f"P(positive | disease) = {p_positive_given_disease}")
print(f"P(positive | no disease) = {p_positive_given_no_disease}")
print(f"P(positive) [total prob] = {p_positive:.4f}")
print(f"P(disease | positive) = {p_disease_given_positive:.4f}")
print(f"\nOnly {p_disease_given_positive*100:.2f}% of people who test positive actually have the disease!")
print("\n")
# === Simulation with 1,000,000 people ===
n_people = 1_000_000
# Step 1: Determine who has the disease (1 = disease, 0 = no disease)
has_disease = np.random.choice([1, 0], size=n_people, p=[p_disease, p_no_disease])
# Step 2: Determine test results based on disease status
# For each person, generate a random number and compare to the appropriate rate
random_vals = np.random.random(n_people) # uniform random numbers between 0 and 1
# A person tests positive if:
# - They have the disease AND random < 0.95 (true positive), OR
# - They don't have the disease AND random < 0.10 (false positive)
tests_positive = np.where(
has_disease == 1,
random_vals < p_positive_given_disease, # sick people: 95% chance of positive
random_vals < p_positive_given_no_disease # healthy people: 10% chance of positive
)
# Step 3: Among those who tested positive, what fraction actually has the disease?
positive_mask = tests_positive == True
true_positives = np.sum(has_disease[positive_mask] == 1) # tested positive AND has disease
total_positives = np.sum(positive_mask) # total who tested positive
simulated_probability = true_positives / total_positives
print("=== Simulation (1,000,000 people) ===")
print(f"People with disease: {np.sum(has_disease):>8d}")
print(f"People without disease: {np.sum(has_disease == 0):>8d}")
print(f"Total positive tests: {total_positives:>8d}")
print(f"True positives: {true_positives:>8d}")
print(f"False positives: {total_positives - true_positives:>8d}")
print(f"\nSimulated P(disease | positive) = {simulated_probability:.4f}")
print(f"Analytical P(disease | positive) = {p_disease_given_positive:.4f}")
print(f"\nThe simulation confirms our analytical result!")
# Visualization: Breaking down what happens to 10,000 people
# Numbers for 10,000 people
total = 10000
sick = int(total * p_disease) # 100 have the disease
healthy = total - sick # 9900 are healthy
true_pos = int(sick * p_positive_given_disease) # 95 sick people test positive
false_neg = sick - true_pos # 5 sick people test negative
false_pos = int(healthy * p_positive_given_no_disease) # 990 healthy people test positive
true_neg = healthy - false_pos # 8910 healthy people test negative
# Create a bar chart showing the breakdown
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: Population breakdown
categories = ['Have Disease\n(100)', 'No Disease\n(9,900)']
counts = [sick, healthy]
colors = ['#e74c3c', '#2ecc71']
axes[0].bar(categories, counts, color=colors, edgecolor='black', linewidth=0.5)
axes[0].set_title('Population Breakdown (10,000 people)', fontsize=13)
axes[0].set_ylabel('Number of people', fontsize=11)
for i, v in enumerate(counts):
axes[0].text(i, v + 100, str(v), ha='center', fontweight='bold', fontsize=12)
# Right plot: Test results breakdown
categories_2 = ['True Positive\n(Sick + Pos)', 'False Positive\n(Healthy + Pos)',
'False Negative\n(Sick + Neg)', 'True Negative\n(Healthy + Neg)']
counts_2 = [true_pos, false_pos, false_neg, true_neg]
colors_2 = ['#e74c3c', '#e67e22', '#f39c12', '#2ecc71']
bars = axes[1].bar(categories_2, counts_2, color=colors_2, edgecolor='black', linewidth=0.5)
axes[1].set_title('Test Results Breakdown', fontsize=13)
axes[1].set_ylabel('Number of people', fontsize=11)
axes[1].tick_params(axis='x', labelsize=9)
for i, v in enumerate(counts_2):
axes[1].text(i, v + 100, str(v), ha='center', fontweight='bold', fontsize=11)
plt.suptitle(f'Out of {true_pos + false_pos} positive tests, only {true_pos} truly have the disease '
f'({true_pos/(true_pos+false_pos)*100:.1f}%)',
fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
print(f"Key insight: {false_pos} false positives vastly outnumber {true_pos} true positives.")
print(f"This is why P(disease | positive test) is only ~{true_pos/(true_pos+false_pos)*100:.1f}%, not 95%.")
4. Random Variables and Expected Value¶
What is a Random Variable?¶
A random variable is simply a variable whose value depends on the outcome of a random experiment. It's a way to assign numbers to outcomes.
Think of it this way: when you roll a die, the outcome is a random variable. We typically call it $X$. Before you roll, you don't know what $X$ will be. After you roll, $X$ takes a specific value like 3 or 5.
There are two types:
Discrete random variables take on countable values (you can list them):
- Die roll: $X \in \{1, 2, 3, 4, 5, 6\}$
- Coin flip: $X \in \{0, 1\}$ (where 0 = Tails, 1 = Heads)
- Number of emails you receive today: $X \in \{0, 1, 2, 3, \ldots\}$
Continuous random variables can take any value in a range:
- Height: $X \in \mathbb{R}^+$ (could be 5.2 feet, 5.73 feet, etc.)
- Temperature: $X \in \mathbb{R}$ (could be 72.3°F, 72.31°F, etc.)
- Time to complete a task: $X \in [0, \infty)$
Notation convention: We use capital letters ($X$, $Y$, $Z$) for random variables and lowercase letters ($x$, $y$, $z$) for specific values they take. So "$P(X = x)$" means "the probability that the random variable $X$ takes the specific value $x$."
Expected Value (The Average You'd Expect)¶
The expected value $\mathbb{E}[X]$ is the average value you'd get if you repeated an experiment infinitely many times.
Think of it this way: it's the "center of gravity" of the probability distribution. If you put weights on a number line according to how probable each value is, the expected value is where the number line would balance.
For a Discrete Random Variable¶
$$\mathbb{E}[X] = \sum_{i} x_i \cdot P(X = x_i)$$
Let's unpack this notation:
- $\sum_{i}$: the Greek letter sigma means "add up" — we sum over all possible values
- $x_i$: each possible value the variable can take
- $P(X = x_i)$: the probability of that value occurring
- We multiply each value by its probability, then add them all up
In plain English: "Take each possible value, multiply it by how likely it is, and add everything together."
Worked Example: Expected Value of a Fair Die Roll¶
Each face (1 through 6) has probability $\dfrac{1}{6}$:
$$\mathbb{E}[X] = 1 \cdot \dfrac{1}{6} + 2 \cdot \dfrac{1}{6} + 3 \cdot \dfrac{1}{6} + 4 \cdot \dfrac{1}{6} + 5 \cdot \dfrac{1}{6} + 6 \cdot \dfrac{1}{6}$$
$$= \dfrac{1 + 2 + 3 + 4 + 5 + 6}{6} = \dfrac{21}{6} = 3.5$$
Notice that 3.5 is not even a possible outcome! You can never roll a 3.5. The expected value doesn't have to be a value the variable can take — it's the long-run average. If you rolled a die thousands of times and averaged all the results, you'd get very close to 3.5.
For a Continuous Random Variable¶
$$\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx$$
where $f(x)$ is the probability density function (PDF). Don't worry about the integral notation if you haven't seen it — just know it's the continuous version of "adding everything up." Instead of summing over a list of values, we're summing over a smooth curve.
Variance and Standard Deviation — How Spread Out Are the Values?¶
The expected value tells us the center, but it doesn't tell us how spread out the values are. Two distributions can have the same mean but look very different.
Variance¶
The variance measures how much the values spread out from the mean:
$$\mathrm{Var}(X) = \mathbb{E}[(X - \mu)^2]$$
where $\mu = \mathbb{E}[X]$ is the mean.
Think of it this way: for each possible value, we measure how far it is from the mean (that's $X - \mu$). We square that distance (to make negatives positive and to penalize large deviations more). Then we take the average of all those squared distances.
There's a useful shortcut formula (equivalent but often easier to compute):
$$\mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$
In words: "the average of the squares minus the square of the average."
Standard Deviation¶
The standard deviation is simply the square root of the variance:
$$\sigma = \sqrt{\mathrm{Var}(X)}$$
Why do we need both? Variance is in squared units (if $X$ is in meters, variance is in meters$^2$). Standard deviation puts us back in the original units, making it easier to interpret.
- Small standard deviation: values cluster tightly around the mean
- Large standard deviation: values are spread far from the mean
Worked Example: Variance of a Fair Die Roll¶
We already know $\mu = \mathbb{E}[X] = 3.5$. Let's use the shortcut formula.
First, compute $\mathbb{E}[X^2]$:
$$\mathbb{E}[X^2] = 1^2 \cdot \dfrac{1}{6} + 2^2 \cdot \dfrac{1}{6} + 3^2 \cdot \dfrac{1}{6} + 4^2 \cdot \dfrac{1}{6} + 5^2 \cdot \dfrac{1}{6} + 6^2 \cdot \dfrac{1}{6}$$
$$= \dfrac{1 + 4 + 9 + 16 + 25 + 36}{6} = \dfrac{91}{6} \approx 15.167$$
Then:
$$\mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \dfrac{91}{6} - (3.5)^2 = \dfrac{91}{6} - \dfrac{49}{4} = \dfrac{182 - 147}{12} = \dfrac{35}{12} \approx 2.917$$
$$\sigma = \sqrt{\dfrac{35}{12}} \approx 1.708$$
# === Expected Value, Variance, Std of a Fair Die Roll ===
# --- Analytical (by hand, using the formulas) ---
values = np.array([1, 2, 3, 4, 5, 6]) # possible outcomes
probs = np.array([1/6] * 6) # each has equal probability 1/6
# E[X] = sum of x_i * P(x_i)
expected_value = np.sum(values * probs)
# E[X^2] = sum of x_i^2 * P(x_i)
expected_x_squared = np.sum(values**2 * probs)
# Var(X) = E[X^2] - (E[X])^2
variance = expected_x_squared - expected_value**2
# Std = sqrt(Var)
std_dev = np.sqrt(variance)
print("=== Analytical Results ===")
print(f"E[X] = {expected_value:.4f} (expected: 3.5)")
print(f"E[X^2] = {expected_x_squared:.4f} (expected: 91/6 = 15.1667)")
print(f"Var(X) = {variance:.4f} (expected: 35/12 = 2.9167)")
print(f"Std(X) = {std_dev:.4f} (expected: sqrt(35/12) = 1.7078)")
print("\n")
# --- Simulation ---
n_rolls = 100_000
die_rolls = np.random.choice(values, size=n_rolls) # roll the die 100,000 times
sim_mean = np.mean(die_rolls) # sample mean
sim_var = np.var(die_rolls) # sample variance
sim_std = np.std(die_rolls) # sample standard deviation
print(f"=== Simulation Results ({n_rolls:,} rolls) ===")
print(f"Sample Mean = {sim_mean:.4f}")
print(f"Sample Variance = {sim_var:.4f}")
print(f"Sample Std Dev = {sim_std:.4f}")
print(f"\nThe simulation results match the analytical formulas closely!")
# Visualization: Law of Large Numbers — running mean converging to 3.5
n_rolls = 10000
die_rolls = np.random.choice([1, 2, 3, 4, 5, 6], size=n_rolls) # roll 10,000 times
# Compute running mean: mean after 1 roll, after 2 rolls, ..., after n rolls
running_mean = np.cumsum(die_rolls) / np.arange(1, n_rolls + 1)
# Plot on a log scale for the x-axis to see convergence clearly
plt.figure(figsize=(10, 5))
plt.semilogx(range(1, n_rolls + 1), running_mean, color='steelblue', linewidth=0.8,
label='Running mean of die rolls')
plt.axhline(y=3.5, color='red', linestyle='--', linewidth=2, label='E[X] = 3.5 (true mean)')
plt.xlabel('Number of rolls (log scale)', fontsize=12)
plt.ylabel('Running mean', fontsize=12)
plt.title('Law of Large Numbers: Running Mean Converges to E[X] = 3.5', fontsize=14)
plt.legend(fontsize=11)
plt.ylim(1, 6)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("The Law of Large Numbers guarantees that the sample mean")
print("converges to the true expected value as the sample size grows.")
print("This is why simulations work — with enough samples, we can")
print("approximate any probability or expected value.")
5. Probability Distributions¶
Bernoulli Distribution — The Coin Flip¶
The Bernoulli distribution is the simplest possible distribution. It models a single experiment with exactly two outcomes: success or failure.
Think of it as a single coin flip (though the coin doesn't have to be fair):
- Success (value 1) happens with probability $p$
- Failure (value 0) happens with probability $1 - p$
Written compactly:
$$P(X = 1) = p \qquad P(X = 0) = 1 - p$$
Its mean and variance are beautifully simple:
$$\mathbb{E}[X] = p \qquad \mathrm{Var}(X) = p(1-p)$$
Notice that variance is maximized when $p = 0.5$ (a fair coin is the most uncertain!) and is zero when $p = 0$ or $p = 1$ (no uncertainty at all).
ML connection: Every binary classification problem is a Bernoulli trial:
- Spam or not spam? ($p$ = probability of spam)
- Cat or dog? ($p$ = probability of cat)
- Click or no click? ($p$ = probability of clicking)
Logistic regression, for example, directly models the parameter $p$ of a Bernoulli distribution.
Binomial Distribution — Counting Successes¶
What if you repeat a Bernoulli trial multiple times and count the number of successes? That count follows a binomial distribution.
Example: Flip a coin 10 times. How many heads do you get? That number (0 through 10) follows a binomial distribution.
The Formula¶
$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$
Let's break down every part:
| Symbol | Meaning |
|---|---|
| $k$ | The number of successes we're asking about |
| $n$ | The total number of trials |
| $p$ | Probability of success on each individual trial |
| $\binom{n}{k}$ | "$n$ choose $k$" — the number of different ways to arrange $k$ successes among $n$ trials |
| $p^k$ | Probability of getting exactly $k$ successes (each with probability $p$) |
| $(1-p)^{n-k}$ | Probability of getting exactly $n-k$ failures (each with probability $1-p$) |
The binomial coefficient $\binom{n}{k}$ (read "$n$ choose $k$") is:
$$\binom{n}{k} = \dfrac{n!}{k!(n-k)!}$$
where $n! = n \times (n-1) \times (n-2) \times \cdots \times 1$ is the factorial.
Worked Example: 7 Heads in 10 Coin Flips¶
What's the probability of getting exactly 7 heads in 10 flips of a fair coin ($p = 0.5$)?
Step 1: Compute the binomial coefficient. $$\binom{10}{7} = \dfrac{10!}{7! \cdot 3!} = \dfrac{10 \times 9 \times 8}{3 \times 2 \times 1} = \dfrac{720}{6} = 120$$
This means there are 120 different ways to arrange 7 heads among 10 flips.
Step 2: Compute $p^k = (0.5)^7 = 0.0078125$
Step 3: Compute $(1-p)^{n-k} = (0.5)^3 = 0.125$
Step 4: Multiply everything together. $$P(X = 7) = 120 \times 0.0078125 \times 0.125 = 120 \times 0.000977 = 0.1172$$
So there's about an 11.7% chance of getting exactly 7 heads in 10 fair coin flips.
The mean and variance of the binomial are:
$$\mathbb{E}[X] = np \qquad \mathrm{Var}(X) = np(1-p)$$
# === Binomial Distribution: PMF for n=10, p=0.5 ===
from math import comb, factorial # comb(n, k) computes the binomial coefficient
n = 10 # number of trials
p = 0.5 # probability of success on each trial
print(f"=== Binomial PMF: n={n}, p={p} ===")
print(f"{'k':>3s} {'P(X=k)':>10s}")
print("-" * 16)
# Calculate P(X = k) for each possible value of k
pmf_values = []
for k in range(n + 1):
# The binomial PMF formula
binom_coeff = comb(n, k) # n choose k
prob = binom_coeff * (p**k) * ((1-p)**(n-k)) # full formula
pmf_values.append(prob)
print(f"{k:>3d} {prob:>10.4f}")
print(f"\nSum of all probabilities: {sum(pmf_values):.4f} (should be 1.0)")
print("\n")
# --- Verify mean and variance with simulation ---
n_experiments = 100_000
# Each experiment: flip a coin 10 times, count number of heads
simulated_counts = np.random.binomial(n=n, p=p, size=n_experiments)
print(f"=== Verifying Mean and Variance ({n_experiments:,} experiments) ===")
print(f"Analytical Mean: E[X] = n*p = {n*p:.2f}")
print(f"Simulated Mean: {np.mean(simulated_counts):.4f}")
print(f"Analytical Variance: Var(X) = n*p*(1-p) = {n*p*(1-p):.2f}")
print(f"Simulated Variance: {np.var(simulated_counts):.4f}")
# Visualization: Binomial PMF for different values of p
n = 10 # number of trials (same for all)
p_values = [0.2, 0.5, 0.8] # three different success probabilities
k_vals = np.arange(0, n + 1) # possible number of successes: 0 to 10
fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=True)
for idx, p_val in enumerate(p_values):
# Compute PMF for each k
pmf = [comb(n, k) * (p_val**k) * ((1-p_val)**(n-k)) for k in k_vals]
# Bar chart of the PMF
axes[idx].bar(k_vals, pmf, color='steelblue', edgecolor='black', linewidth=0.5, alpha=0.8)
axes[idx].set_xlabel('k (number of successes)', fontsize=11)
axes[idx].set_title(f'Binomial(n={n}, p={p_val})', fontsize=12)
axes[idx].set_xticks(k_vals)
# Add mean line
mean_val = n * p_val
axes[idx].axvline(x=mean_val, color='red', linestyle='--', linewidth=1.5,
label=f'Mean = {mean_val:.1f}')
axes[idx].legend(fontsize=10)
axes[0].set_ylabel('P(X = k)', fontsize=11)
plt.suptitle('Binomial Distribution: How p Affects the Shape', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("Notice how:")
print("- p=0.2: skewed right (most likely to get few successes)")
print("- p=0.5: symmetric (centered at n/2 = 5)")
print("- p=0.8: skewed left (most likely to get many successes)")
The Gaussian (Normal) Distribution — The Most Important Distribution¶
If you only learn one probability distribution, make it the Gaussian (also called the Normal distribution). It's arguably the most important distribution in all of statistics and machine learning.
Why Is It So Important?¶
Many natural phenomena follow it. Heights of people, test scores, measurement errors, blood pressure readings — they all tend to form a bell-shaped curve.
The Central Limit Theorem. When you average many random numbers together, the result tends to follow a Gaussian — regardless of what the original distribution looked like. This is one of the most remarkable results in all of mathematics.
Many ML algorithms assume it. Linear regression assumes Gaussian errors. Gaussian Naive Bayes, Gaussian Mixture Models, and many other methods are built directly on this distribution.
The Probability Density Function (PDF)¶
$$f(x) = \dfrac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\dfrac{(x - \mu)^2}{2\sigma^2}\right)$$
This looks intimidating, but let's break down every piece:
| Symbol | Name | What It Does |
|---|---|---|
| $\mu$ (mu) | Mean | The center of the bell curve. The peak is at $x = \mu$. |
| $\sigma$ (sigma) | Standard deviation | Controls the width. Larger $\sigma$ = wider, flatter bell. |
| $\sigma^2$ | Variance | The square of the standard deviation. |
| $\pi$ | Pi | The familiar constant $\approx 3.14159$ |
| $e$ | Euler's number | The mathematical constant $\approx 2.71828$ |
| $\exp(\cdot)$ | Exponential | Shorthand for $e^{(\cdot)}$ |
| $\dfrac{1}{\sqrt{2\pi\sigma^2}}$ | Normalization | Makes the total area under the curve equal to 1 |
| $(x - \mu)^2$ | Squared distance from mean | Values far from $\mu$ get smaller $f(x)$ |
The standard normal distribution is the special case where $\mu = 0$ and $\sigma = 1$.
The 68-95-99.7 Rule¶
One of the most useful facts about the Gaussian:
- ~68% of values fall within 1 standard deviation of the mean: $[\mu - \sigma, \mu + \sigma]$
- ~95% of values fall within 2 standard deviations: $[\mu - 2\sigma, \mu + 2\sigma]$
- ~99.7% of values fall within 3 standard deviations: $[\mu - 3\sigma, \mu + 3\sigma]$
This means that values more than 3 standard deviations from the mean are extremely rare (only 0.3% of the time). This is often used to detect outliers.
# Generate Gaussian samples and overlay the analytical PDF
mu = 5.0 # mean
sigma = 1.5 # standard deviation
n_samples = 10000
# Generate random samples from a Gaussian distribution
samples = np.random.normal(loc=mu, scale=sigma, size=n_samples)
# Create x values for plotting the analytical PDF
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 300) # range: mean +/- 4 std devs
# The Gaussian PDF formula
pdf = (1 / np.sqrt(2 * np.pi * sigma**2)) * np.exp(-(x - mu)**2 / (2 * sigma**2))
# Plot histogram of samples overlaid with the analytical PDF
plt.figure(figsize=(10, 5))
plt.hist(samples, bins=50, density=True, alpha=0.6, color='steelblue',
edgecolor='black', linewidth=0.3, label=f'Histogram ({n_samples:,} samples)')
plt.plot(x, pdf, 'r-', linewidth=2.5, label=f'Analytical PDF: N({mu}, {sigma}$^2$)')
plt.axvline(x=mu, color='darkred', linestyle='--', linewidth=1.5, label=f'Mean = {mu}')
plt.xlabel('x', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Gaussian Distribution: Histogram vs Analytical PDF', fontsize=14)
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Sample mean: {np.mean(samples):.3f} (true: {mu})")
print(f"Sample std: {np.std(samples):.3f} (true: {sigma})")
print(f"The histogram closely matches the smooth PDF curve!")
# Two subplots: effect of changing mean vs changing std dev
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
x = np.linspace(-5, 12, 500) # x-axis range
# --- Left plot: Different means, same std dev ---
sigma_fixed = 1.0
means = [0, 2, 4]
colors_means = ['#e74c3c', '#2ecc71', '#3498db']
for mu_val, color in zip(means, colors_means):
# Compute PDF for each mean
pdf_val = (1 / np.sqrt(2 * np.pi * sigma_fixed**2)) * \
np.exp(-(x - mu_val)**2 / (2 * sigma_fixed**2))
axes[0].plot(x, pdf_val, linewidth=2.5, color=color,
label=f'$\\mu={mu_val}$, $\\sigma={sigma_fixed}$')
axes[0].set_xlabel('x', fontsize=12)
axes[0].set_ylabel('f(x)', fontsize=12)
axes[0].set_title('Changing the Mean $\\mu$ (shifts the curve)', fontsize=13)
axes[0].legend(fontsize=11)
axes[0].grid(alpha=0.3)
# --- Right plot: Same mean, different std devs ---
mu_fixed = 3.0
sigmas = [0.5, 1.0, 2.0]
colors_sigmas = ['#e74c3c', '#2ecc71', '#3498db']
for sigma_val, color in zip(sigmas, colors_sigmas):
# Compute PDF for each sigma
pdf_val = (1 / np.sqrt(2 * np.pi * sigma_val**2)) * \
np.exp(-(x - mu_fixed)**2 / (2 * sigma_val**2))
axes[1].plot(x, pdf_val, linewidth=2.5, color=color,
label=f'$\\mu={mu_fixed}$, $\\sigma={sigma_val}$')
axes[1].set_xlabel('x', fontsize=12)
axes[1].set_ylabel('f(x)', fontsize=12)
axes[1].set_title('Changing $\\sigma$ (stretches/compresses the curve)', fontsize=13)
axes[1].legend(fontsize=11)
axes[1].grid(alpha=0.3)
plt.suptitle('How Mean and Standard Deviation Shape the Gaussian', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("Left: Changing mu shifts the curve left or right (the center moves).")
print("Right: Changing sigma makes the curve wider (more spread) or narrower (more concentrated).")
print(" Notice that wider curves are shorter — the total area must always be 1.")
The Multivariate Gaussian¶
So far, our Gaussian has described a single variable (e.g., height). But in machine learning, we usually have multiple features (e.g., height AND weight AND age). The multivariate Gaussian extends the bell curve to multiple dimensions.
The PDF for a $d$-dimensional multivariate Gaussian is:
$$f(\mathbf{x}) = \dfrac{1}{(2\pi)^{d/2}|\mathbf{\Sigma}|^{1/2}} \exp\left(-\dfrac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)$$
Don't worry about memorizing this formula! The key things to understand are:
| Symbol | What It Is | |--------|------------| | $\mathbf{x}$ | A vector of $d$ features (e.g., $[\text{height}, \text{weight}]$) | | $\boldsymbol{\mu}$ | The mean vector — the center of the distribution in $d$-dimensional space | | $\mathbf{\Sigma}$ | The covariance matrix — a $d \times d$ matrix that captures how the features vary and relate | | $|\mathbf{\Sigma}|$ | The determinant of $\mathbf{\Sigma}$ (a scalar that measures the "volume" of the distribution) | | $\mathbf{\Sigma}^{-1}$ | The inverse of the covariance matrix | | $d$ | The number of dimensions (features) |
The covariance matrix $\mathbf{\Sigma}$ is the key new concept:
- Diagonal entries $\Sigma_{ii}$: the variance of each individual feature (how much it spreads)
- Off-diagonal entries $\Sigma_{ij}$: the covariance between features $i$ and $j$ (how they co-vary)
- Positive covariance: features increase together
- Negative covariance: when one increases, the other decreases
- Zero covariance: features are independent
If $\mathbf{\Sigma} = \mathbf{I}$ (the identity matrix), all features are independent and equally spread — the distribution looks like a circle in 2D.
# 2D Multivariate Gaussians with different covariance matrices
n_samples = 500 # number of samples to draw
mean = [0, 0] # mean vector (center at origin)
# Three different covariance matrices
cov_matrices = [
np.array([[1, 0], [0, 1]]), # Identity: independent, circular
np.array([[1, 0.8], [0.8, 1]]), # Positive correlation: tilted right
np.array([[1, -0.8], [-0.8, 1]]) # Negative correlation: tilted left
]
titles = [
'$\\Sigma = I$ (Independent)\nCircular shape',
'Positive Correlation ($\\rho=0.8$)\nTilted right',
'Negative Correlation ($\\rho=-0.8$)\nTilted left'
]
fig, axes = plt.subplots(1, 3, figsize=(15, 4.5))
for idx, (cov, title) in enumerate(zip(cov_matrices, titles)):
# Generate 2D samples from the multivariate Gaussian
samples_2d = np.random.multivariate_normal(mean, cov, size=n_samples)
# Scatter plot of the samples
axes[idx].scatter(samples_2d[:, 0], samples_2d[:, 1], alpha=0.4, s=10, color='steelblue')
axes[idx].set_xlabel('$x_1$', fontsize=12)
axes[idx].set_ylabel('$x_2$', fontsize=12)
axes[idx].set_title(title, fontsize=11)
axes[idx].set_xlim(-4, 4)
axes[idx].set_ylim(-4, 4)
axes[idx].set_aspect('equal') # equal scaling so circles look like circles
axes[idx].grid(alpha=0.3)
# Mark the mean
axes[idx].plot(0, 0, 'r+', markersize=15, markeredgewidth=2)
plt.suptitle('Multivariate Gaussian: How Covariance Affects Shape', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("Left: Identity covariance -> features are independent -> circular cloud.")
print("Middle: Positive covariance -> when x1 is high, x2 tends to be high too -> tilted right.")
print("Right: Negative covariance -> when x1 is high, x2 tends to be low -> tilted left.")
6. Feature Standardization — A Direct Application¶
Let's put everything we've learned to practical use. One of the most common data preprocessing steps in machine learning is feature standardization (also called z-score normalization).
The Problem¶
Different features often live on very different scales:
- Age: 0–100
- Salary: 30,000–200,000
- Temperature: -20 to 45
Many ML algorithms (gradient descent, k-nearest neighbors, SVMs) are sensitive to feature scales. If salary has values in the tens of thousands while age is in the tens, the algorithm will be dominated by salary — even if age is equally important.
The Solution: Z-Score Standardization¶
Transform each feature so it has mean 0 and standard deviation 1:
$$x' = \dfrac{x - \mu}{\sigma}$$
where:
- $x$ is the original value
- $\mu$ is the mean of that feature
- $\sigma$ is the standard deviation of that feature
- $x'$ is the standardized value
After this transformation:
- The new mean is 0
- The new standard deviation is 1
- The values represent "how many standard deviations away from the mean" — this is exactly the z-score
Think of it this way: if your height has a z-score of +2.0, you're 2 standard deviations above average — you're tall! A z-score of -1.5 means 1.5 standard deviations below average.
Notice how this connects directly to what we learned about mean ($\mu$) and standard deviation ($\sigma$) earlier!
# === Feature Standardization in Practice ===
# Generate some raw data: imagine these are salaries (in thousands)
np.random.seed(42)
raw_data = np.random.normal(loc=65, scale=15, size=1000) # mean ~65k, std ~15k
# Print original statistics
print("=== Before Standardization ===")
print(f"Mean: {np.mean(raw_data):.2f}")
print(f"Std: {np.std(raw_data):.2f}")
print(f"Min: {np.min(raw_data):.2f}")
print(f"Max: {np.max(raw_data):.2f}")
# Apply z-score standardization: x' = (x - mu) / sigma
mu = np.mean(raw_data) # compute the mean
sigma = np.std(raw_data) # compute the standard deviation
standardized_data = (raw_data - mu) / sigma # apply the formula
print("\n=== After Standardization ===")
print(f"Mean: {np.mean(standardized_data):.6f} (should be ~0)")
print(f"Std: {np.std(standardized_data):.6f} (should be ~1)")
print(f"Min: {np.min(standardized_data):.2f}")
print(f"Max: {np.max(standardized_data):.2f}")
# --- Before/After Histograms ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Before standardization
axes[0].hist(raw_data, bins=40, color='steelblue', edgecolor='black',
linewidth=0.3, alpha=0.7)
axes[0].axvline(x=np.mean(raw_data), color='red', linestyle='--', linewidth=2,
label=f'Mean = {np.mean(raw_data):.1f}')
axes[0].set_xlabel('Salary (thousands)', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Before Standardization', fontsize=13)
axes[0].legend(fontsize=11)
# After standardization
axes[1].hist(standardized_data, bins=40, color='#2ecc71', edgecolor='black',
linewidth=0.3, alpha=0.7)
axes[1].axvline(x=0, color='red', linestyle='--', linewidth=2,
label=f'Mean = 0')
axes[1].set_xlabel('Standardized value (z-score)', fontsize=12)
axes[1].set_ylabel('Count', fontsize=12)
axes[1].set_title('After Standardization', fontsize=13)
axes[1].legend(fontsize=11)
plt.suptitle('Feature Standardization: Z-Score Normalization', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("\nThe shape of the distribution stays the same — we've only shifted and rescaled it.")
print("Now the data is centered at 0 with unit variance, ready for ML algorithms.")
Check Your Understanding¶
Now it's your turn! Try these exercises to solidify your understanding. Solutions are provided below each task, but try to work through them yourself first.
Task 1: Bayes' Theorem in a Factory¶
A factory produces light bulbs. Here are the facts:
- 5% of bulbs are defective ($P(D) = 0.05$)
- The inspector catches 90% of defective bulbs ($P(F|D) = 0.90$, where $F$ = "flagged")
- The inspector also incorrectly flags 3% of good bulbs ($P(F|\text{not } D) = 0.03$)
Question: If a bulb is flagged by the inspector, what's the probability it's actually defective?
Try it yourself first! Use Bayes' theorem: $$P(D|F) = \dfrac{P(F|D) \cdot P(D)}{P(F)}$$
Remember to compute $P(F)$ using the law of total probability.
# === Task 1 Solution: Bayes' Theorem for Factory Inspection ===
# --- Analytical Solution ---
p_defective = 0.05 # P(D): probability a bulb is defective
p_good = 1 - p_defective # P(not D): probability a bulb is good
p_flagged_given_defective = 0.90 # P(F|D): inspector catches defective bulbs
p_flagged_given_good = 0.03 # P(F|not D): inspector incorrectly flags good bulbs
# Step 1: P(F) using law of total probability
p_flagged = (p_flagged_given_defective * p_defective +
p_flagged_given_good * p_good)
# Step 2: Bayes' theorem
p_defective_given_flagged = (p_flagged_given_defective * p_defective) / p_flagged
print("=== Analytical Solution ===")
print(f"P(Defective) = {p_defective}")
print(f"P(Flagged | Defective) = {p_flagged_given_defective}")
print(f"P(Flagged | Good) = {p_flagged_given_good}")
print(f"P(Flagged) [total prob] = {p_flagged:.4f}")
print(f"")
print(f"P(Defective | Flagged) = {p_defective_given_flagged:.4f}")
print(f" = {p_defective_given_flagged*100:.2f}%")
print("\n")
# --- Simulation ---
np.random.seed(42)
n_bulbs = 1_000_000
# Determine which bulbs are defective
is_defective = np.random.random(n_bulbs) < p_defective # True = defective
# Determine which bulbs get flagged
random_vals = np.random.random(n_bulbs)
is_flagged = np.where(
is_defective,
random_vals < p_flagged_given_defective, # defective bulbs: 90% flagged
random_vals < p_flagged_given_good # good bulbs: 3% flagged
)
# Among flagged bulbs, what fraction is actually defective?
flagged_and_defective = np.sum(is_defective & is_flagged)
total_flagged = np.sum(is_flagged)
sim_result = flagged_and_defective / total_flagged
print(f"=== Simulation ({n_bulbs:,} bulbs) ===")
print(f"Total defective: {np.sum(is_defective):>8,}")
print(f"Total flagged: {total_flagged:>8,}")
print(f"Flagged AND defective: {flagged_and_defective:>8,}")
print(f"")
print(f"Simulated P(D|F) = {sim_result:.4f} ({sim_result*100:.2f}%)")
print(f"Analytical P(D|F) = {p_defective_given_flagged:.4f} ({p_defective_given_flagged*100:.2f}%)")
print(f"\nThe simulation confirms the analytical result!")
Task 2: Verify the 68-95-99.7 Rule¶
Generate 1000 samples from a Gaussian distribution with $\mu = 10$ and $\sigma = 3$.
Then count what percentage of samples fall within:
- 1 standard deviation of the mean: $[10 - 3, 10 + 3] = [7, 13]$
- 2 standard deviations: $[10 - 6, 10 + 6] = [4, 16]$
- 3 standard deviations: $[10 - 9, 10 + 9] = [1, 19]$
Compare your results to the expected 68%, 95%, and 99.7%.
# === Task 2 Solution: Verify the 68-95-99.7 Rule ===
mu = 10 # mean
sigma = 3 # standard deviation
n_samples = 1000
# Generate samples
np.random.seed(42)
samples = np.random.normal(loc=mu, scale=sigma, size=n_samples)
# Check what fraction falls within 1, 2, and 3 standard deviations
for n_std in [1, 2, 3]:
lower = mu - n_std * sigma # lower bound
upper = mu + n_std * sigma # upper bound
# Count how many samples are within the range
within = np.sum((samples >= lower) & (samples <= upper))
percentage = within / n_samples * 100
# Expected percentages from the rule
expected = {1: 68, 2: 95, 3: 99.7}[n_std]
print(f"Within {n_std} std dev [{lower:.0f}, {upper:.0f}]: "
f"{within}/{n_samples} = {percentage:.1f}% (expected ~{expected}%)")
print("\n")
# --- Visualization ---
plt.figure(figsize=(10, 5))
# Histogram of samples
plt.hist(samples, bins=40, density=True, alpha=0.6, color='steelblue',
edgecolor='black', linewidth=0.3, label='Samples')
# Overlay the PDF
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 300)
pdf = (1 / np.sqrt(2 * np.pi * sigma**2)) * np.exp(-(x - mu)**2 / (2 * sigma**2))
plt.plot(x, pdf, 'r-', linewidth=2, label='PDF')
# Shade the regions for 1, 2, 3 std devs
colors_shade = ['#e74c3c', '#f39c12', '#2ecc71']
alphas = [0.3, 0.2, 0.1]
for n_std, color, alpha in zip([3, 2, 1], reversed(colors_shade), alphas):
lower = mu - n_std * sigma
upper = mu + n_std * sigma
mask = (x >= lower) & (x <= upper)
plt.fill_between(x[mask], pdf[mask], alpha=0.25,
label=f'{n_std}$\\sigma$: [{lower:.0f}, {upper:.0f}]')
plt.xlabel('x', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title(f'The 68-95-99.7 Rule: N($\\mu$={mu}, $\\sigma$={sigma})', fontsize=14)
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("The 68-95-99.7 rule is approximate, but our simulation matches it closely.")
print("With more samples (try 1,000,000!), the match would be even better.")
Summary¶
Let's recap the key concepts we covered:
| Concept | Key Idea | ML Connection |
|---|---|---|
| Probability | Quantifies uncertainty on a 0-to-1 scale | Every model prediction is a probability |
| Conditional Probability | How one event affects another: $P(A \mid B)$ | Features inform predictions |
| Bayes' Theorem | Update beliefs with evidence | Naive Bayes classifier, Bayesian ML |
| Expected Value | The long-run average: $\mathbb{E}[X]$ | Loss functions, reward in RL |
| Variance / Std Dev | How spread out values are | Regularization, uncertainty estimation |
| Bernoulli / Binomial | Coin flips and counting successes | Binary classification |
| Gaussian Distribution | The bell curve — central to statistics | Assumed by many ML algorithms |
| Multivariate Gaussian | Bell curve in multiple dimensions | Gaussian Mixture Models, Bayesian methods |
| Feature Standardization | Z-score: $(x - \mu) / \sigma$ | Essential preprocessing step |
These concepts form the probabilistic foundation for understanding machine learning. Nearly every ML algorithm either explicitly uses probability (Bayesian methods, logistic regression, neural networks with softmax) or implicitly relies on probabilistic assumptions (linear regression, SVMs).
What's next? With these fundamentals in place, you're ready to explore:
- Statistics: hypothesis testing, confidence intervals, maximum likelihood estimation
- Information theory: entropy, cross-entropy, KL divergence (used in neural network loss functions)
- Probabilistic models: Naive Bayes, Gaussian Mixture Models, Bayesian neural networks