Statistics for Machine Learning¶
Statistics is the science of learning from data. While probability theory (covered in the previous notebook) tells us how to reason about uncertainty and randomness, statistics gives us the tools to actually work with real data — to summarize it, find patterns, and draw conclusions.
In machine learning, statistics shows up everywhere:
- Understanding your data before building any model (descriptive statistics)
- Finding relationships between features (covariance, correlation)
- Fitting models to data (maximum likelihood estimation)
- Defining loss functions that tell your model how wrong it is (cross-entropy, KL divergence)
We'll cover two major branches:
- Descriptive statistics — summarizing and visualizing data (mean, variance, correlation)
- Inferential statistics — drawing conclusions from data and fitting models (MLE, information theory)
Prerequisites¶
This notebook assumes you've already read the Probability notebook. We'll freely use concepts like expected value ($\mathbb{E}[X]$), variance ($\mathrm{Var}(X)$), and probability distributions without re-deriving them. If any of those feel unfamiliar, go back and review that notebook first.
Let's get started!
import numpy as np # numerical computing
import matplotlib.pyplot as plt # plotting and visualization
1. Descriptive Statistics — Summarizing Data¶
Imagine you've just been handed a spreadsheet with 10,000 rows of customer data. You can't look at every single number — you need a way to summarize all that information into a few key numbers. That's exactly what descriptive statistics does.
We'll start with two fundamental questions:
- Where is the center of the data? (Measures of central tendency)
- How spread out is the data? (Measures of spread)
Measures of Central Tendency¶
These tell us where the "center" of our data is. Think of it this way: if you had to describe a whole dataset with a single number, what would you pick?
Mean (Average)¶
The mean is the most familiar measure: add up all values and divide by the count.
$$\bar{x} = \dfrac{1}{n} \sum_{i=1}^{n} x_i$$
Let's break down every symbol:
- $\bar{x}$ (read "x-bar"): the sample mean — our answer
- $n$: the number of data points
- $x_i$: the $i$-th data point (e.g., $x_1$ is the first value, $x_2$ is the second, etc.)
- $\sum_{i=1}^{n}$: "sum up" — add together all values from $x_1$ through $x_n$
Worked example: Given the data $\{2, 4, 4, 4, 5, 5, 7, 9\}$:
$$\bar{x} = \dfrac{2 + 4 + 4 + 4 + 5 + 5 + 7 + 9}{8} = \dfrac{40}{8} = 5$$
Median¶
The median is the middle value when data is sorted. If there's an even number of values, take the average of the two middle ones.
Think of it this way: line up all your data points from smallest to largest, then walk to the exact middle.
For $\{2, 4, 4, 4, 5, 5, 7, 9\}$ (8 values, so we average positions 4 and 5):
- Position 4: value = 4
- Position 5: value = 5
- Median = $\dfrac{4 + 5}{2} = 4.5$
Mode¶
The mode is the value that appears most often. In our example, mode = 4 (appears 3 times, more than any other value).
When to Use Which?¶
| Measure | Best for | Sensitive to outliers? |
|---|---|---|
| Mean | Symmetric data without outliers | Yes — one extreme value can pull it far |
| Median | Skewed data or data with outliers (income, house prices) | No — outliers barely affect it |
| Mode | Categorical data (most popular color, most common category) | No |
The classic example: if Bill Gates walks into a coffee shop, the mean income in the room skyrockets, but the median barely changes. That's why median household income is more informative than mean household income.
# Our sample data
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
# --- Compute mean, median, mode ---
mean = np.mean(data) # sum of all values / count
median = np.median(data) # middle value when sorted
# NumPy doesn't have a mode function, so we compute it manually
values, counts = np.unique(data, return_counts=True) # count occurrences of each unique value
mode = values[np.argmax(counts)] # find the value with the highest count
print(f"Data: {data}")
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
# --- Now let's see what happens when we add an outlier ---
data_with_outlier = np.append(data, 100) # add an extreme value
mean_outlier = np.mean(data_with_outlier)
median_outlier = np.median(data_with_outlier)
print(f"\n--- After adding outlier (100) ---")
print(f"Data: {data_with_outlier}")
print(f"Mean: {mean} → {mean_outlier:.2f} (shifted by {mean_outlier - mean:.2f}!)")
print(f"Median: {median} → {median_outlier} (barely moved!)")
print(f"\nNotice: the mean jumped from 5 to {mean_outlier:.2f}, but the median only moved from 4.5 to {median_outlier}.")
print("This is why the median is called a 'robust' statistic — it resists outliers.")
Measures of Spread¶
Knowing the center isn't enough. Two datasets can have the exact same mean but look completely different. Measures of spread tell us how spread out the data is around the center.
Range¶
The simplest measure: $\text{Range} = \text{max} - \text{min}$
For $\{2, 4, 4, 4, 5, 5, 7, 9\}$: Range = $9 - 2 = 7$
The range is easy to compute but very sensitive to outliers (a single extreme value changes it dramatically).
Variance¶
Variance is the average squared distance from the mean. Think of it this way: for each data point, ask "how far is this from the mean?" Square that distance (to make everything positive), then average all those squared distances.
For a population (when you have ALL the data):
$$\sigma^2 = \dfrac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2$$
- $\sigma^2$ ("sigma squared"): the population variance
- $N$: total number of data points in the population
- $\mu$: the population mean
- $(x_i - \mu)^2$: squared distance of each point from the mean
For a sample (a subset of the data):
$$s^2 = \dfrac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2$$
- $s^2$: the sample variance
- $n$: number of data points in the sample
- $\bar{x}$: the sample mean
Why $n-1$ instead of $n$? This is called Bessel's correction. When we compute the sample mean $\bar{x}$, we've already "used up" one piece of information from the data. Dividing by $n$ would systematically underestimate the true population variance. Dividing by $n-1$ corrects this bias, giving us an unbiased estimate.
In NumPy:
np.var(data, ddof=0)— population variance (divides by $N$)np.var(data, ddof=1)— sample variance (divides by $n - 1$)ddofstands for "delta degrees of freedom"
Standard Deviation¶
The standard deviation is simply the square root of the variance:
$$\sigma = \sqrt{\sigma^2} \quad \text{(population)} \qquad s = \sqrt{s^2} \quad \text{(sample)}$$
Why take the square root? Because variance is in squared units (if your data is in meters, variance is in meters$^2$). The standard deviation brings us back to the same units as the original data, making it much easier to interpret.
Worked Example¶
Let's compute variance and standard deviation step by step for $\{2, 4, 6\}$.
Step 1: Find the mean. $$\bar{x} = \dfrac{2 + 4 + 6}{3} = \dfrac{12}{3} = 4$$
Step 2: Find each point's difference from the mean.
- $x_1 - \bar{x} = 2 - 4 = -2$
- $x_2 - \bar{x} = 4 - 4 = 0$
- $x_3 - \bar{x} = 6 - 4 = 2$
Step 3: Square those differences.
- $(-2)^2 = 4$
- $(0)^2 = 0$
- $(2)^2 = 4$
Step 4: Average the squared differences.
- Population variance ($\div n = 3$): $\sigma^2 = \dfrac{4 + 0 + 4}{3} = \dfrac{8}{3} \approx 2.67$
- Sample variance ($\div (n-1) = 2$): $s^2 = \dfrac{4 + 0 + 4}{2} = \dfrac{8}{2} = 4$
Step 5: Take the square root.
- Population std dev: $\sigma = \sqrt{2.67} \approx 1.63$
- Sample std dev: $s = \sqrt{4} = 2$
# --- Manual step-by-step computation ---
data = np.array([2, 4, 6])
# Step 1: Compute the mean
mean = np.sum(data) / len(data)
print(f"Data: {data}")
print(f"Step 1 — Mean: {mean}")
# Step 2: Differences from the mean
differences = data - mean
print(f"Step 2 — Differences from mean: {differences}")
# Step 3: Squared differences
squared_diffs = differences ** 2
print(f"Step 3 — Squared differences: {squared_diffs}")
# Step 4: Average the squared differences
pop_variance = np.sum(squared_diffs) / len(data) # divide by n (population)
sample_variance = np.sum(squared_diffs) / (len(data) - 1) # divide by n-1 (sample)
print(f"Step 4 — Population variance (÷{len(data)}): {pop_variance:.4f}")
print(f" Sample variance (÷{len(data)-1}): {sample_variance:.4f}")
# Step 5: Square root for standard deviation
pop_std = np.sqrt(pop_variance)
sample_std = np.sqrt(sample_variance)
print(f"Step 5 — Population std dev: {pop_std:.4f}")
print(f" Sample std dev: {sample_std:.4f}")
# --- Verify with NumPy ---
print(f"\n--- Verification with NumPy ---")
print(f"np.var(data, ddof=0) = {np.var(data, ddof=0):.4f} (population, matches {pop_variance:.4f})")
print(f"np.var(data, ddof=1) = {np.var(data, ddof=1):.4f} (sample, matches {sample_variance:.4f})")
print(f"np.std(data, ddof=0) = {np.std(data, ddof=0):.4f} (population, matches {pop_std:.4f})")
print(f"np.std(data, ddof=1) = {np.std(data, ddof=1):.4f} (sample, matches {sample_std:.4f})")
# Visualization: Two datasets with the same mean but different spreads
np.random.seed(42)
# Both datasets have mean ≈ 50, but very different standard deviations
narrow_data = np.random.normal(loc=50, scale=5, size=1000) # std = 5 (tightly clustered)
wide_data = np.random.normal(loc=50, scale=20, size=1000) # std = 20 (widely spread)
fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)
# Left plot: narrow spread
axes[0].hist(narrow_data, bins=30, color='steelblue', edgecolor='white', alpha=0.8)
axes[0].axvline(np.mean(narrow_data), color='red', linestyle='--', linewidth=2, label=f'Mean = {np.mean(narrow_data):.1f}')
axes[0].set_title(f'Narrow Spread (std = {np.std(narrow_data):.1f})', fontsize=13)
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Count')
axes[0].legend(fontsize=11)
axes[0].set_xlim(-20, 120)
# Right plot: wide spread
axes[1].hist(wide_data, bins=30, color='coral', edgecolor='white', alpha=0.8)
axes[1].axvline(np.mean(wide_data), color='red', linestyle='--', linewidth=2, label=f'Mean = {np.mean(wide_data):.1f}')
axes[1].set_title(f'Wide Spread (std = {np.std(wide_data):.1f})', fontsize=13)
axes[1].set_xlabel('Value')
axes[1].legend(fontsize=11)
axes[1].set_xlim(-20, 120)
plt.suptitle('Same Mean, Different Spreads', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
print(f"Narrow: mean = {np.mean(narrow_data):.2f}, std = {np.std(narrow_data):.2f}")
print(f"Wide: mean = {np.mean(wide_data):.2f}, std = {np.std(wide_data):.2f}")
print("\nNotice: both datasets have the same mean, but very different shapes!")
print("The standard deviation captures this difference.")
2. Covariance and Correlation — How Features Relate¶
So far we've looked at one variable at a time. But in machine learning, we almost always have multiple features. We need to understand how they relate to each other.
For example:
- Do taller people tend to weigh more? (height and weight)
- Do students who study more get higher grades? (study hours and test scores)
- Does increasing the price decrease demand? (price and demand)
Covariance — Do Two Variables Move Together?¶
When we have two features (like height and weight), we want to know: when one goes up, does the other tend to go up too?
Covariance measures this joint variability:
$$\mathrm{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)]$$
For a sample of data:
$$\mathrm{Cov}(X, Y) = \dfrac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})$$
Let's build the intuition for why this formula works. For each data point $(x_i, y_i)$, we compute the product $(x_i - \bar{x})(y_i - \bar{y})$:
| $X$ relative to $\bar{x}$ | $Y$ relative to $\bar{y}$ | Product | Contribution |
|---|---|---|---|
| Above (positive) | Above (positive) | Positive | Pushes covariance up |
| Below (negative) | Below (negative) | Positive | Pushes covariance up |
| Above (positive) | Below (negative) | Negative | Pushes covariance down |
| Below (negative) | Above (positive) | Negative | Pushes covariance down |
So:
- Positive covariance: $X$ and $Y$ tend to move together (height and weight)
- Negative covariance: $X$ and $Y$ tend to move in opposite directions (price and demand)
- Zero covariance: no consistent linear relationship
The Problem with Covariance¶
Covariance tells us the direction of the relationship, but its magnitude depends on the scales of the variables. If you measure height in centimeters vs. meters, you get completely different covariance values, even though the relationship hasn't changed. This makes covariance hard to interpret on its own.
Pearson Correlation — Standardized Covariance¶
The Pearson correlation coefficient fixes the scaling problem by dividing covariance by the product of the standard deviations:
$$\rho_{X,Y} = \dfrac{\mathrm{Cov}(X, Y)}{\sigma_X \sigma_Y}$$
Let's break this down:
- $\rho_{X,Y}$ (the Greek letter "rho"): the correlation coefficient
- $\mathrm{Cov}(X, Y)$: the covariance we just learned
- $\sigma_X$: standard deviation of $X$
- $\sigma_Y$: standard deviation of $Y$
Think of it this way: dividing by $\sigma_X \sigma_Y$ is like converting both variables to a common scale, so the result is always between $-1$ and $1$:
| Value of $\rho$ | Interpretation | |:---:|:---| | $\rho = 1$ | Perfect positive linear relationship — points lie exactly on an upward line | | $\rho = -1$ | Perfect negative linear relationship — points lie exactly on a downward line | | $\rho = 0$ | No linear relationship (but there might be a non-linear one!) | | $|\rho| > 0.7$ | Strong correlation | | $0.3 < |\rho| < 0.7$ | Moderate correlation | | $|\rho| < 0.3$ | Weak correlation |
Important caveat: Correlation measures linear relationships only. Two variables can have $\rho = 0$ and still be strongly related in a non-linear way (like $Y = X^2$).
The Covariance Matrix¶
When you have many features (say $d$ features), you can compute the covariance between every pair and organize them in a $d \times d$ matrix:
$$\mathbf{C} = \dfrac{1}{n-1}(\mathbf{X} - \bar{\mathbf{X}})^T(\mathbf{X} - \bar{\mathbf{X}})$$
where:
- $\mathbf{X}$ is the $n \times d$ data matrix ($n$ samples, $d$ features)
- $\bar{\mathbf{X}}$ is a matrix where each column is filled with that column's mean
- The superscript $T$ means transpose
The structure of the covariance matrix is:
$$\mathbf{C} = \begin{pmatrix} \mathrm{Var}(X_1) & \mathrm{Cov}(X_1, X_2) & \cdots & \mathrm{Cov}(X_1, X_d) \\ \mathrm{Cov}(X_2, X_1) & \mathrm{Var}(X_2) & \cdots & \mathrm{Cov}(X_2, X_d) \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{Cov}(X_d, X_1) & \mathrm{Cov}(X_d, X_2) & \cdots & \mathrm{Var}(X_d) \end{pmatrix}$$
Notice three important properties:
- Diagonal entries = variance of each feature ($\mathrm{Cov}(X_i, X_i) = \mathrm{Var}(X_i)$)
- Off-diagonal entries = covariance between pairs of features
- The matrix is always symmetric: $\mathrm{Cov}(X, Y) = \mathrm{Cov}(Y, X)$
The covariance matrix is foundational in ML — it's central to Principal Component Analysis (PCA), Gaussian distributions, and linear discriminant analysis.
# Generate correlated data: height (cm) and weight (kg)
np.random.seed(42)
n = 100
# Heights: mean 170 cm, std 8 cm
height = np.random.normal(170, 8, n)
# Weights: correlated with height (taller people tend to weigh more)
# weight ≈ 0.8 * height + noise
weight = 0.8 * height + np.random.normal(0, 5, n)
# --- Manual computation of covariance ---
mean_h = np.mean(height) # mean of heights
mean_w = np.mean(weight) # mean of weights
# Differences from their respective means
diff_h = height - mean_h
diff_w = weight - mean_w
# Covariance: average of the products of differences
cov_manual = np.sum(diff_h * diff_w) / (n - 1)
# Standard deviations
std_h = np.std(height, ddof=1) # sample std
std_w = np.std(weight, ddof=1) # sample std
# Correlation: covariance divided by product of standard deviations
corr_manual = cov_manual / (std_h * std_w)
print(f"=== Manual Computation ===")
print(f"Covariance(height, weight) = {cov_manual:.4f}")
print(f"Correlation(height, weight) = {corr_manual:.4f}")
# --- Verify with NumPy ---
# np.cov returns a 2x2 matrix; [0,1] or [1,0] gives the covariance between the two
cov_numpy = np.cov(height, weight)[0, 1]
# np.corrcoef also returns a 2x2 matrix
corr_numpy = np.corrcoef(height, weight)[0, 1]
print(f"\n=== NumPy Verification ===")
print(f"np.cov: {cov_numpy:.4f} (matches manual: {np.isclose(cov_manual, cov_numpy)})")
print(f"np.corrcoef: {corr_numpy:.4f} (matches manual: {np.isclose(corr_manual, corr_numpy)})")
print(f"\nInterpretation: ρ = {corr_manual:.2f} → strong positive correlation.")
print("Taller people do tend to weigh more in this data!")
# Scatter plots showing positive, negative, and no correlation
np.random.seed(42)
n = 200
# Generate three types of relationships
x = np.random.normal(0, 1, n)
y_positive = 0.9 * x + np.random.normal(0, 0.4, n) # strong positive correlation
y_negative = -0.7 * x + np.random.normal(0, 0.5, n) # moderate negative correlation
y_none = np.random.normal(0, 1, n) # no correlation
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
datasets = [
(x, y_positive, 'Positive Correlation', 'steelblue'),
(x, y_negative, 'Negative Correlation', 'coral'),
(x, y_none, 'No Correlation', 'mediumseagreen'),
]
for ax, (xi, yi, title, color) in zip(axes, datasets):
rho = np.corrcoef(xi, yi)[0, 1] # compute Pearson correlation
ax.scatter(xi, yi, alpha=0.5, s=20, color=color)
ax.set_title(f"{title}\nρ = {rho:.2f}", fontsize=13)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.axhline(0, color='gray', linewidth=0.5)
ax.axvline(0, color='gray', linewidth=0.5)
plt.suptitle('Correlation: How Two Variables Relate', fontsize=15, fontweight='bold', y=1.03)
plt.tight_layout()
plt.show()
# Compute the covariance matrix manually and compare with np.cov
np.random.seed(42)
# Create a dataset with 3 features and 5 samples
# Each row is a sample, each column is a feature
X = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[2, 4, 6],
[5, 3, 7]
], dtype=float)
n_samples, n_features = X.shape
print(f"Data matrix X ({n_samples} samples × {n_features} features):")
print(X)
# Step 1: Compute the mean of each feature (column)
means = np.mean(X, axis=0) # axis=0 → average down the rows → one mean per column
print(f"\nMean of each feature: {means}")
# Step 2: Center the data (subtract column means)
X_centered = X - means # broadcasting: each row has the means subtracted
print(f"\nCentered data (X - mean):")
print(X_centered)
# Step 3: Compute covariance matrix using the formula: C = (1/(n-1)) * X_centered^T @ X_centered
cov_manual = (1 / (n_samples - 1)) * (X_centered.T @ X_centered)
print(f"\nCovariance matrix (manual):")
print(np.round(cov_manual, 4))
# Step 4: Verify with NumPy (rowvar=False tells np.cov that columns are variables)
cov_numpy = np.cov(X, rowvar=False)
print(f"\nCovariance matrix (np.cov):")
print(np.round(cov_numpy, 4))
# Check they match
print(f"\nMatrices match: {np.allclose(cov_manual, cov_numpy)}")
# Observe the properties
print(f"\n--- Properties ---")
print(f"Diagonal (variances): {np.diag(cov_manual).round(4)}")
print(f"Symmetric? C[0,1] = {cov_manual[0,1]:.4f}, C[1,0] = {cov_manual[1,0]:.4f} → equal!")
3. Maximum Likelihood Estimation — Finding the Best Parameters¶
Now we shift from describing data to learning from data. Given some observations, how do we find the best parameters for a model?
This is one of the most important ideas in statistics and machine learning.
The Intuition — A Coin Flipping Example¶
Suppose you flip a coin 10 times and get 7 heads and 3 tails. What's the probability of heads, $p$?
Your intuition probably says $p = 7/10 = 0.7$. But why? Let's think about it more carefully.
Imagine testing different values of $p$:
- If $p = 0.5$: the probability of getting exactly 7 heads out of 10 flips is $\binom{10}{7} (0.5)^7 (0.5)^3 = 0.117$
- If $p = 0.7$: the probability is $\binom{10}{7} (0.7)^7 (0.3)^3 = 0.267$
- If $p = 0.9$: the probability is $\binom{10}{7} (0.9)^7 (0.1)^3 = 0.057$
Notice that $p = 0.7$ makes the observed data most probable. That's the key idea!
Maximum Likelihood Estimation (MLE) formalizes this intuition: find the parameter values that make the observed data most likely to have occurred.
The Formal Framework¶
Given observed data $x_1, x_2, \ldots, x_n$ and a probability model with parameter(s) $\theta$, the likelihood function measures how probable the data is for a given parameter value:
$$\mathcal{L}(\theta) = \prod_{i=1}^{n} f(x_i \mid \theta)$$
Let's break this down:
- $\mathcal{L}(\theta)$: the likelihood — a function of the parameter $\theta$
- $\prod_{i=1}^{n}$: like $\sum$ but for multiplication — multiply together $n$ terms
- $f(x_i \mid \theta)$: the probability (or density) of observing $x_i$, given parameter $\theta$
We assume data points are independent, so the joint probability is the product of individual probabilities.
Why Use the Log-Likelihood?¶
Multiplying many small probabilities together leads to tiny numbers that computers can't handle well (underflow). Instead, we take the logarithm, which turns products into sums:
$$\ell(\theta) = \ln \mathcal{L}(\theta) = \sum_{i=1}^{n} \ln f(x_i \mid \theta)$$
Since $\ln$ is a strictly increasing function, maximizing $\ell(\theta)$ gives the same answer as maximizing $\mathcal{L}(\theta)$.
The MLE estimate $\hat{\theta}$ is:
$$\hat{\theta} = \arg\max_{\theta} \ell(\theta)$$
Read this as: "$\hat{\theta}$ is the value of $\theta$ that maximizes the log-likelihood."
MLE for the Gaussian Distribution¶
This is the most important example. Suppose our data comes from a Gaussian (normal) distribution $\mathcal{N}(\mu, \sigma^2)$, and we want to find the best $\mu$ and $\sigma^2$.
The density function is: $$f(x_i \mid \mu, \sigma^2) = \dfrac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\dfrac{(x_i - \mu)^2}{2\sigma^2}\right)$$
Taking the log-likelihood and setting derivatives to zero (the calculus details are in many textbooks), the MLE estimates turn out to be beautifully simple:
MLE for the mean: $$\hat{\mu} = \bar{x} = \dfrac{1}{n}\sum_{i=1}^{n} x_i$$
It's just the sample mean — exactly what you'd expect!
MLE for the variance: $$\hat{\sigma}^2 = \dfrac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$$
Notice this divides by $n$, not $n-1$. The MLE estimate of variance is slightly biased (it tends to underestimate the true variance), but it's the value that maximizes the likelihood.
Why Does This Matter for ML?¶
Here's a beautiful connection: when you minimize Mean Squared Error (MSE) loss in linear regression, you're implicitly doing MLE under the assumption that errors follow a Gaussian distribution.
Think of it this way:
- Assume your model's prediction errors are Gaussian: $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$
- The log-likelihood of the data is proportional to $-\sum_i (y_i - \hat{y}_i)^2$
- Maximizing this = minimizing $\sum_i (y_i - \hat{y}_i)^2$ = minimizing MSE!
So MSE isn't just an arbitrary choice — it's the statistically principled loss function when you assume Gaussian errors.
# Generate Gaussian data with KNOWN parameters, then estimate them with MLE
np.random.seed(42)
# True parameters (pretend we don't know these)
true_mu = 5.0
true_sigma = 2.0
# Generate 200 data points from N(5, 2²)
data = np.random.normal(loc=true_mu, scale=true_sigma, size=200)
# MLE estimates
mu_hat = np.mean(data) # MLE for mean = sample mean
sigma2_hat = np.mean((data - mu_hat) ** 2) # MLE for variance (divides by n, not n-1)
sigma_hat = np.sqrt(sigma2_hat) # MLE for std dev
# Compare with true values
print(f"True parameters: μ = {true_mu}, σ = {true_sigma}")
print(f"MLE estimates: μ̂ = {mu_hat:.4f}, σ̂ = {sigma_hat:.4f}")
print(f"\nThe estimates are close to the true values!")
print(f"With more data, they would converge even closer.")
# Show the difference between MLE variance (÷n) and unbiased variance (÷(n-1))
var_mle = np.var(data, ddof=0) # MLE: divides by n
var_unbiased = np.var(data, ddof=1) # unbiased: divides by n-1
print(f"\nMLE variance (÷n): {var_mle:.4f}")
print(f"Unbiased variance (÷n-1): {var_unbiased:.4f}")
print(f"True variance: {true_sigma**2:.4f}")
print(f"\nFor n={len(data)}, the difference is small: {abs(var_mle - var_unbiased):.4f}")
# Contour plot of the log-likelihood surface over (μ, σ) space
# Define a grid of μ and σ values to evaluate
mu_range = np.linspace(3.5, 6.5, 200) # range of possible μ values
sigma_range = np.linspace(1.0, 3.5, 200) # range of possible σ values
MU, SIGMA = np.meshgrid(mu_range, sigma_range) # 2D grid
# Compute log-likelihood at each (μ, σ) point
# ℓ(μ, σ) = -n/2 * ln(2π) - n * ln(σ) - 1/(2σ²) * Σ(xᵢ - μ)²
n = len(data)
LL = np.zeros_like(MU) # log-likelihood values
for i in range(MU.shape[0]):
for j in range(MU.shape[1]):
mu_val = MU[i, j]
sigma_val = SIGMA[i, j]
# log-likelihood formula for Gaussian
LL[i, j] = (-n / 2) * np.log(2 * np.pi) - n * np.log(sigma_val) \
- np.sum((data - mu_val) ** 2) / (2 * sigma_val ** 2)
# Plot the contour
fig, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(MU, SIGMA, LL, levels=50, cmap='viridis') # filled contours
ax.contour(MU, SIGMA, LL, levels=15, colors='white', linewidths=0.5, alpha=0.5) # contour lines
plt.colorbar(contour, ax=ax, label='Log-Likelihood')
# Mark the MLE estimate with a star
ax.plot(mu_hat, sigma_hat, 'r*', markersize=20, label=f'MLE: μ̂={mu_hat:.2f}, σ̂={sigma_hat:.2f}')
# Mark the true values with a diamond
ax.plot(true_mu, true_sigma, 'w^', markersize=12, label=f'True: μ={true_mu}, σ={true_sigma}')
ax.set_xlabel('μ (mean)', fontsize=12)
ax.set_ylabel('σ (standard deviation)', fontsize=12)
ax.set_title('Log-Likelihood Surface for Gaussian Parameters', fontsize=14)
ax.legend(fontsize=11, loc='upper right')
plt.tight_layout()
plt.show()
print("The MLE estimate (red star) sits at the peak of the log-likelihood surface.")
print("This is the (μ, σ) pair that makes our observed data MOST probable.")
4. Information Theory — Measuring Uncertainty¶
Information theory was invented by Claude Shannon in 1948 to study communication, but it turns out to be deeply connected to machine learning. The key concepts — entropy, cross-entropy, and KL divergence — directly give rise to the loss functions used in classification.
Entropy — How Uncertain Are We?¶
Think of entropy as a measure of "surprise" or "uncertainty" in a distribution. The more uncertain you are about the outcome, the higher the entropy.
Let's build intuition with coin flips:
| Coin | P(Heads) | P(Tails) | How uncertain? | Entropy |
|---|---|---|---|---|
| Fair coin | 0.50 | 0.50 | Very — could go either way | HIGH |
| Biased coin | 0.90 | 0.10 | Fairly confident it's heads | LOW |
| Two-headed coin | 1.00 | 0.00 | No uncertainty at all | ZERO |
Think of it this way: if someone tells you the result of a fair coin flip, you learn a lot (you were very uncertain). If someone tells you the result of a two-headed coin flip, you learn nothing (you already knew it would be heads). Entropy quantifies this "information content."
The Entropy Formula¶
$$H(X) = -\sum_{i} P(x_i) \ln P(x_i)$$
Let's break down every part:
- $H(X)$: the entropy of random variable $X$
- $P(x_i)$: the probability of outcome $x_i$
- $\ln P(x_i)$: the natural logarithm (log base $e$) of that probability
- The negative sign: since $\ln$ of a number between 0 and 1 is always negative, the minus sign makes entropy positive
- The sum $\sum_{i}$: add up over all possible outcomes
Worked Example — Fair Coin¶
$P(\text{heads}) = 0.5$, $P(\text{tails}) = 0.5$
$$H = -[0.5 \ln(0.5) + 0.5 \ln(0.5)]$$ $$= -[0.5 \times (-0.693) + 0.5 \times (-0.693)]$$ $$= -[-0.347 - 0.347]$$ $$= 0.693 \text{ nats}$$
(When using $\ln$, entropy is measured in "nats". When using $\log_2$, it's measured in "bits".)
Worked Example — Biased Coin (90/10)¶
$P(\text{heads}) = 0.9$, $P(\text{tails}) = 0.1$
$$H = -[0.9 \ln(0.9) + 0.1 \ln(0.1)]$$ $$= -[0.9 \times (-0.105) + 0.1 \times (-2.303)]$$ $$= -[-0.095 - 0.230]$$ $$= 0.325 \text{ nats}$$
Lower entropy = less uncertainty = more predictable.
Notice: $0.325 < 0.693$ — the biased coin has less entropy than the fair coin, confirming our intuition.
Cross-Entropy — How Good Are Our Predictions?¶
Now the key connection to machine learning. Suppose:
- $p$ is the TRUE distribution (actual labels)
- $q$ is the PREDICTED distribution (model's predictions)
Cross-entropy measures how well $q$ approximates $p$:
$$H(p, q) = -\sum_{i} p(x_i) \ln q(x_i)$$
Notice the key difference from entropy: we use probabilities from $p$ (the truth) but take the log of $q$ (our predictions).
Key insight: Cross-entropy is minimized when $q = p$ — when our predictions perfectly match reality. In fact, the minimum value of cross-entropy equals the entropy of $p$:
$$H(p, q) \geq H(p)$$
So minimizing cross-entropy = making better predictions = getting $q$ closer to $p$.
Binary Cross-Entropy (Log Loss)¶
For binary classification (two classes: 0 or 1), the cross-entropy simplifies to the log loss (binary cross-entropy):
$$L = -[y \ln(\hat{y}) + (1 - y) \ln(1 - \hat{y})]$$
where:
- $y$ is the true label (0 or 1)
- $\hat{y}$ is the predicted probability of class 1
Let's see why this makes sense:
- If the true label is $y = 1$ and we predict $\hat{y} = 0.9$: $L = -\ln(0.9) = 0.105$ (small loss — good!)
- If the true label is $y = 1$ and we predict $\hat{y} = 0.1$: $L = -\ln(0.1) = 2.303$ (large loss — bad!)
- If the true label is $y = 0$ and we predict $\hat{y} = 0.1$: $L = -\ln(0.9) = 0.105$ (small loss — good!)
This is the loss function used in logistic regression and many neural networks for classification.
KL Divergence — How Different Are Two Distributions?¶
The Kullback-Leibler (KL) divergence measures the "distance" between two probability distributions:
$$D_{KL}(p \| q) = \sum_{i} p(x_i) \ln \dfrac{p(x_i)}{q(x_i)}$$
This can also be written as:
$$D_{KL}(p \| q) = H(p, q) - H(p)$$
In other words: KL divergence = cross-entropy minus entropy. It measures the extra cost of using distribution $q$ instead of the true distribution $p$.
Think of it this way: if you design a communication system optimized for distribution $q$, but the actual data follows distribution $p$, the KL divergence tells you how many extra nats (or bits) you waste per message.
Properties of KL Divergence¶
- Non-negative: $D_{KL}(p \| q) \geq 0$ — always
- Zero iff equal: $D_{KL}(p \| q) = 0$ if and only if $p = q$
- NOT symmetric: $D_{KL}(p \| q) \neq D_{KL}(q \| p)$ in general
Because it's not symmetric, KL divergence isn't a true "distance" (despite sometimes being called "KL distance"). The notation $D_{KL}(p \| q)$ is read as "the KL divergence from $q$ to $p$" or "the divergence of $q$ from $p$."
Why Does This Matter for ML?¶
Since $H(p)$ is constant (the true distribution doesn't change during training), minimizing cross-entropy $H(p, q)$ is the same as minimizing KL divergence $D_{KL}(p \| q)$. So when you train a classifier with cross-entropy loss, you're really minimizing the KL divergence between the true label distribution and your model's predicted distribution.
def entropy(p):
"""
Compute the entropy of a discrete probability distribution.
H(p) = -Σ p(x) * ln(p(x))
We handle p(x) = 0 by defining 0 * ln(0) = 0 (the mathematical limit).
"""
p = np.array(p, dtype=float)
# Filter out zero probabilities to avoid ln(0)
mask = p > 0
return -np.sum(p[mask] * np.log(p[mask]))
def cross_entropy(p, q):
"""
Compute the cross-entropy between two distributions.
H(p, q) = -Σ p(x) * ln(q(x))
"""
p = np.array(p, dtype=float)
q = np.array(q, dtype=float)
# Only sum where p > 0 (0 * ln(q) = 0 by convention)
mask = p > 0
return -np.sum(p[mask] * np.log(q[mask]))
def kl_divergence(p, q):
"""
Compute the KL divergence D_KL(p || q).
D_KL(p || q) = Σ p(x) * ln(p(x) / q(x)) = H(p, q) - H(p)
"""
return cross_entropy(p, q) - entropy(p)
# --- Test on coins ---
print("=== Entropy of different coins ===")
fair_coin = [0.5, 0.5]
biased_coin = [0.9, 0.1]
very_biased = [0.99, 0.01]
print(f"Fair coin (50/50): H = {entropy(fair_coin):.4f} nats")
print(f"Biased coin (90/10): H = {entropy(biased_coin):.4f} nats")
print(f"Very biased coin (99/1): H = {entropy(very_biased):.4f} nats")
print(f"Two-headed coin (100/0): H = {entropy([1.0, 0.0]):.4f} nats")
# --- Cross-entropy and KL divergence ---
print(f"\n=== Cross-Entropy and KL Divergence ===")
# True distribution
p = [0.7, 0.2, 0.1] # true distribution over 3 classes
# Good prediction (close to p)
q_good = [0.65, 0.25, 0.10]
# Bad prediction (far from p)
q_bad = [0.2, 0.3, 0.5]
print(f"True distribution p = {p}")
print(f"Entropy H(p) = {entropy(p):.4f}")
print(f"\nGood prediction q = {q_good}")
print(f" Cross-entropy H(p, q) = {cross_entropy(p, q_good):.4f}")
print(f" KL divergence D_KL(p || q) = {kl_divergence(p, q_good):.4f}")
print(f"\nBad prediction q = {q_bad}")
print(f" Cross-entropy H(p, q) = {cross_entropy(p, q_bad):.4f}")
print(f" KL divergence D_KL(p || q) = {kl_divergence(p, q_bad):.4f}")
# Verify: cross-entropy ≥ entropy, and equals entropy when p = q
print(f"\n=== Verification ===")
print(f"Cross-entropy when q = p: H(p, p) = {cross_entropy(p, p):.4f}")
print(f"Entropy: H(p) = {entropy(p):.4f}")
print(f"H(p, p) = H(p)? {np.isclose(cross_entropy(p, p), entropy(p))} (yes, minimum possible!)")
print(f"KL(p || p) = {kl_divergence(p, p):.4f} (zero, as expected)")
# Visualization: comparing true vs predicted distributions
p = np.array([0.5, 0.3, 0.2]) # true distribution
q_close = np.array([0.45, 0.35, 0.20]) # good approximation
q_far = np.array([0.1, 0.2, 0.7]) # poor approximation
categories = ['Class A', 'Class B', 'Class C']
x = np.arange(len(categories))
width = 0.3 # bar width
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
# Left: Good approximation
axes[0].bar(x - width/2, p, width, label='True p', color='steelblue', edgecolor='white')
axes[0].bar(x + width/2, q_close, width, label='Predicted q', color='coral', edgecolor='white')
axes[0].set_xticks(x)
axes[0].set_xticklabels(categories)
axes[0].set_ylabel('Probability')
axes[0].set_ylim(0, 0.8)
axes[0].legend(fontsize=11)
ce_close = cross_entropy(p, q_close)
kl_close = kl_divergence(p, q_close)
axes[0].set_title(f'Good Prediction\nH(p,q) = {ce_close:.4f}, KL = {kl_close:.4f}', fontsize=12)
# Right: Poor approximation
axes[1].bar(x - width/2, p, width, label='True p', color='steelblue', edgecolor='white')
axes[1].bar(x + width/2, q_far, width, label='Predicted q', color='coral', edgecolor='white')
axes[1].set_xticks(x)
axes[1].set_xticklabels(categories)
axes[1].set_ylabel('Probability')
axes[1].set_ylim(0, 0.8)
axes[1].legend(fontsize=11)
ce_far = cross_entropy(p, q_far)
kl_far = kl_divergence(p, q_far)
axes[1].set_title(f'Poor Prediction\nH(p,q) = {ce_far:.4f}, KL = {kl_far:.4f}', fontsize=12)
plt.suptitle('Cross-Entropy and KL Divergence: Good vs Poor Predictions',
fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
print(f"Entropy of true distribution H(p) = {entropy(p):.4f} (lower bound for cross-entropy)")
print(f"\nGood prediction: cross-entropy = {ce_close:.4f}, KL = {kl_close:.4f} (small extra cost)")
print(f"Poor prediction: cross-entropy = {ce_far:.4f}, KL = {kl_far:.4f} (large extra cost)")
print(f"\nTraining a classifier = minimizing cross-entropy = reducing KL divergence = getting q closer to p.")
Connecting It All to Machine Learning¶
Let's summarize the beautiful connections between statistics and ML:
| Statistical Concept | ML Application |
|---|---|
| Cross-entropy $H(p, q)$ | Loss function for classification (nn.CrossEntropyLoss in PyTorch) |
| Minimizing cross-entropy | Same as maximizing log-likelihood (same math, different sign!) |
| Minimizing MSE | Same as MLE under Gaussian noise assumptions |
| Entropy $H(p)$ | Theoretical limit — no model can do better than this |
| KL divergence $D_{KL}(p \| q)$ | Measures how far your model's predictions are from perfect |
| Covariance matrix | Foundation for PCA, Gaussian models, feature analysis |
The key takeaway: loss functions in ML aren't arbitrary choices — they arise naturally from statistical principles. When you understand why we use cross-entropy or MSE, you can make better choices about which loss function to use for your specific problem.
| Loss Function | When to Use | Statistical Justification |
|---|---|---|
| MSE | Regression (continuous output) | MLE with Gaussian noise |
| Binary cross-entropy | Binary classification | MLE with Bernoulli distribution |
| Categorical cross-entropy | Multi-class classification | MLE with Categorical distribution |
5. Check Your Understanding¶
Let's put everything together with two hands-on tasks that connect statistics to machine learning.
Task 1: From Covariance to Principal Components¶
Given a 2D dataset:
- Compute the covariance matrix
- Perform eigendecomposition (connecting back to your Linear Algebra notebook!)
- Identify the principal component direction — the direction of maximum variance
This is the heart of Principal Component Analysis (PCA), one of the most important dimensionality reduction techniques in ML.
# --- Task 1 Solution: PCA from scratch ---
np.random.seed(42)
# Step 1: Generate 2D correlated data
n = 200
# Start with uncorrelated data
x1 = np.random.normal(0, 2, n) # feature 1: std = 2
x2 = np.random.normal(0, 0.5, n) # feature 2: std = 0.5
# Rotate it by 30 degrees to create correlation
angle = np.radians(30)
rotation_matrix = np.array([
[np.cos(angle), -np.sin(angle)],
[np.sin(angle), np.cos(angle)]
])
data = np.column_stack([x1, x2]) @ rotation_matrix.T # apply rotation
print("Step 1: Generated 2D correlated data")
print(f" Shape: {data.shape} ({n} samples, 2 features)")
# Step 2: Compute the covariance matrix
cov_matrix = np.cov(data, rowvar=False) # rowvar=False: columns are variables
print(f"\nStep 2: Covariance matrix:")
print(f" {cov_matrix.round(4)}")
# Step 3: Eigendecomposition of the covariance matrix
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix) # eigh for symmetric matrices
# Sort by eigenvalue (largest first)
sort_idx = np.argsort(eigenvalues)[::-1] # indices that sort in descending order
eigenvalues = eigenvalues[sort_idx]
eigenvectors = eigenvectors[:, sort_idx]
print(f"\nStep 3: Eigendecomposition:")
print(f" Eigenvalue 1: {eigenvalues[0]:.4f} (variance along PC1)")
print(f" Eigenvalue 2: {eigenvalues[1]:.4f} (variance along PC2)")
print(f" Eigenvector 1 (PC1): {eigenvectors[:, 0].round(4)} — direction of MAXIMUM variance")
print(f" Eigenvector 2 (PC2): {eigenvectors[:, 1].round(4)} — direction of MINIMUM variance")
# Variance explained
total_var = np.sum(eigenvalues)
print(f"\n Variance explained by PC1: {eigenvalues[0]/total_var*100:.1f}%")
print(f" Variance explained by PC2: {eigenvalues[1]/total_var*100:.1f}%")
# Step 4: Visualize
fig, ax = plt.subplots(figsize=(8, 8))
# Plot the data points
ax.scatter(data[:, 0], data[:, 1], alpha=0.4, s=20, color='steelblue', label='Data')
# Plot eigenvectors as arrows, scaled by eigenvalues
origin = np.mean(data, axis=0) # center of the data
colors = ['red', 'orange']
labels = ['PC1 (max variance)', 'PC2 (min variance)']
for i in range(2):
# Scale arrow length by square root of eigenvalue for visual clarity
vec = eigenvectors[:, i] * np.sqrt(eigenvalues[i]) * 2
ax.annotate('', xy=origin + vec, xytext=origin,
arrowprops=dict(arrowstyle='->', color=colors[i], lw=3))
ax.annotate(f'{labels[i]}\nλ={eigenvalues[i]:.2f}',
xy=origin + vec * 1.1, fontsize=11, color=colors[i], fontweight='bold')
ax.set_xlabel('Feature 1', fontsize=12)
ax.set_ylabel('Feature 2', fontsize=12)
ax.set_title('PCA: Eigenvectors of the Covariance Matrix', fontsize=14)
ax.set_aspect('equal') # equal scaling so arrows look right
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nThe RED arrow (PC1) points along the direction of maximum spread in the data.")
print("If we had to reduce from 2D to 1D, we'd project onto this direction — that's PCA!")
Task 2: Binary Cross-Entropy (Log Loss) from Scratch¶
Implement the binary cross-entropy loss function using only NumPy. Then compare your result with sklearn.metrics.log_loss.
Recall the formula:
$$L = -\dfrac{1}{n} \sum_{i=1}^{n} [y_i \ln(\hat{y}_i) + (1 - y_i) \ln(1 - \hat{y}_i)]$$
where:
- $y_i \in \{0, 1\}$: true label for sample $i$
- $\hat{y}_i \in (0, 1)$: predicted probability for sample $i$
- $n$: number of samples
from sklearn.metrics import log_loss # for comparison
def binary_cross_entropy(y_true, y_pred, eps=1e-15):
"""
Compute binary cross-entropy (log loss) from scratch.
Parameters:
y_true: array of true labels (0 or 1)
y_pred: array of predicted probabilities (between 0 and 1)
eps: small constant to avoid ln(0) which is -infinity
Returns:
The average binary cross-entropy loss
"""
y_true = np.array(y_true, dtype=float)
y_pred = np.array(y_pred, dtype=float)
# Clip predictions to avoid ln(0) — stay in (eps, 1-eps)
y_pred = np.clip(y_pred, eps, 1 - eps)
# Apply the formula: L = -1/n * Σ [y*ln(ŷ) + (1-y)*ln(1-ŷ)]
n = len(y_true)
loss = -np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) / n
return loss
# --- Test data ---
# True labels: 5 positive, 5 negative
y_true = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
# Good predictions (confident and correct)
y_pred_good = np.array([0.9, 0.8, 0.85, 0.95, 0.7, 0.1, 0.2, 0.05, 0.15, 0.1])
# Bad predictions (confident but WRONG)
y_pred_bad = np.array([0.2, 0.3, 0.1, 0.4, 0.15, 0.8, 0.9, 0.7, 0.85, 0.6])
# --- Our implementation ---
loss_good_ours = binary_cross_entropy(y_true, y_pred_good)
loss_bad_ours = binary_cross_entropy(y_true, y_pred_bad)
# --- sklearn's implementation ---
loss_good_sklearn = log_loss(y_true, y_pred_good)
loss_bad_sklearn = log_loss(y_true, y_pred_bad)
print("=== Good Predictions ===")
print(f" Our implementation: {loss_good_ours:.6f}")
print(f" sklearn log_loss: {loss_good_sklearn:.6f}")
print(f" Match: {np.isclose(loss_good_ours, loss_good_sklearn)}")
print(f"\n=== Bad Predictions ===")
print(f" Our implementation: {loss_bad_ours:.6f}")
print(f" sklearn log_loss: {loss_bad_sklearn:.6f}")
print(f" Match: {np.isclose(loss_bad_ours, loss_bad_sklearn)}")
print(f"\n=== Summary ===")
print(f"Good predictions loss: {loss_good_ours:.4f} (low — model is confident and correct)")
print(f"Bad predictions loss: {loss_bad_ours:.4f} (high — model is confident but WRONG)")
print(f"\nThe loss function heavily penalizes confident wrong predictions.")
print(f"This is what makes cross-entropy so effective for training classifiers!")
Summary¶
In this notebook, we covered the statistical foundations that underpin machine learning:
Descriptive Statistics — Mean, median, mode, variance, and standard deviation give us the tools to summarize datasets with a few key numbers.
Covariance and Correlation — Covariance tells us if two features move together; correlation normalizes this to $[-1, 1]$. The covariance matrix captures all pairwise relationships and is central to PCA.
Maximum Likelihood Estimation — MLE formalizes the intuitive idea of finding the parameters that make our data most probable. It connects directly to loss functions: MSE = MLE under Gaussian noise.
Information Theory — Entropy measures uncertainty, cross-entropy measures prediction quality, and KL divergence measures the gap between two distributions. Cross-entropy loss = the standard classification loss in deep learning.
With probability and statistics under your belt, you now have the mathematical foundation to understand why machine learning algorithms work the way they do — not just how to use them.