Linear Algebra for Machine Learning¶
Linear algebra is the branch of mathematics that deals with vectors, matrices, and linear transformations. If that sounds intimidating, don't worry — at its core, linear algebra is about organizing numbers into rows and columns and learning how to work with them efficiently.
Why does this matter for machine learning? Every dataset you'll ever work with is essentially a table of numbers — a matrix. Every data point is a list of numbers — a vector. And every operation a machine learning model performs (making predictions, learning from errors) boils down to matrix and vector operations. Understanding linear algebra gives you the language to understand what's really happening inside these models.
Prerequisites: Basic high school algebra (variables, equations, arithmetic). That's it.
import numpy as np
import matplotlib.pyplot as plt
1. Scalars, Vectors, and Matrices¶
Before we dive in, let's understand the three fundamental objects in linear algebra. Think of them as building blocks — each one builds on the previous.
Scalars¶
A scalar is just a single number. That's it. When we write $a = 5$, $a$ is a scalar.
You already know scalars from high school math — temperature (72°F), your test score (85), the price of a coffee ($4.50). In machine learning, a scalar might represent a single prediction or a single weight in a model.
Mathematically, we say a scalar belongs to the set of real numbers: $a \in \mathbb{R}$. The symbol $\in$ means "belongs to" and $\mathbb{R}$ represents all real numbers (any number on the number line).
Vectors¶
A vector is an ordered list of numbers. Think of it as a row in a spreadsheet or a list of measurements.
For example, if you measure a house's square footage (1500), number of bedrooms (3), and age in years (20), you could represent this house as a vector:
$$\mathbf{x} = \begin{bmatrix} 1500 \\ 3 \\ 20 \end{bmatrix}$$
Here's what the notation means:
- $\mathbf{x}$ — we use bold lowercase letters for vectors
- The numbers stacked vertically form a column vector
- This vector has 3 entries, so we say $\mathbf{x} \in \mathbb{R}^3$ (a vector with 3 real numbers)
- More generally, a vector with $n$ entries is written as $\mathbf{x} \in \mathbb{R}^n$
Each individual entry is accessed by its position (called an index): $x_1 = 1500$, $x_2 = 3$, $x_3 = 20$.
In machine learning, each entry in a vector is called a feature. A single house with 3 measured properties is a vector with 3 features.
# Creating vectors in NumPy
# A vector is just a 1D array of numbers
x = np.array([1500, 3, 20])
print("Vector x:", x)
print("Number of elements:", x.shape[0]) # shape tells us the size
print("Dimensions:", x.ndim) # 1D = vector
# Accessing individual elements (Python uses 0-based indexing)
print("\nFirst element (x[0]):", x[0]) # 1500 (square footage)
print("Second element (x[1]):", x[1]) # 3 (bedrooms)
print("Third element (x[2]):", x[2]) # 20 (age)
# Column vector — reshape to (3, 1) for a column
x_column = x.reshape(-1, 1) # -1 means "figure out this dimension automatically"
print("\nColumn vector:")
print(x_column)
print("Shape:", x_column.shape) # (3, 1) means 3 rows, 1 column
Matrices¶
A matrix is a rectangular grid of numbers arranged in rows and columns. Think of it as a complete spreadsheet or table.
If you have data on 4 houses, each with 3 features (square footage, bedrooms, age), you'd arrange them in a matrix:
$$\mathbf{A} = \begin{bmatrix} 1500 & 3 & 20 \\ 1200 & 2 & 35 \\ 1800 & 4 & 10 \\ 900 & 1 & 50 \end{bmatrix}$$
Here's what the notation means:
- $\mathbf{A}$ — we use bold uppercase letters for matrices
- This matrix has 4 rows and 3 columns, so we write $\mathbf{A} \in \mathbb{R}^{4 \times 3}$
- More generally, a matrix with $m$ rows and $n$ columns is $\mathbf{A} \in \mathbb{R}^{m \times n}$
- Individual entries are written as $a_{ij}$ where $i$ is the row and $j$ is the column
- For example, $a_{2,3} = 35$ (row 2, column 3 — the age of the second house)
In machine learning, a dataset is almost always a matrix where:
- Each row is one data point (one house, one patient, one image)
- Each column is one feature (one measurement, one attribute)
# Creating a matrix in NumPy — a 2D array
A = np.array([
[1500, 3, 20],
[1200, 2, 35],
[1800, 4, 10],
[900, 1, 50]
])
print("Matrix A:")
print(A)
print("\nShape:", A.shape) # (4, 3) means 4 rows, 3 columns
print("Dimensions:", A.ndim) # 2D = matrix
# Accessing elements: A[row, column] (0-indexed)
print("\nElement at row 2, column 3 (A[1, 2]):", A[1, 2]) # 35
# Accessing entire rows and columns
print("\nFirst row:", A[0]) # [1500, 3, 20]
print("Second column:", A[:, 1]) # [3, 2, 4, 1] — all bedrooms
# Transpose: flip rows and columns
print("\nTranspose of A (rows become columns):")
print(A.T)
print("Shape of transpose:", A.T.shape) # (3, 4) — swapped!
2. Basic Operations with Vectors and Matrices¶
Now that we know what scalars, vectors, and matrices are, let's learn how to do math with them. These operations are the building blocks of every machine learning algorithm.
Addition and Scalar Multiplication¶
Vector addition works element by element. You simply add corresponding entries:
$$\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} + \begin{bmatrix} 4 \\ 5 \\ 6 \end{bmatrix} = \begin{bmatrix} 1+4 \\ 2+5 \\ 3+6 \end{bmatrix} = \begin{bmatrix} 5 \\ 7 \\ 9 \end{bmatrix}$$
Important: You can only add vectors (or matrices) of the same shape. Adding a 3-element vector to a 4-element vector doesn't make sense.
Scalar multiplication means multiplying every entry by a single number:
$$3 \times \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} = \begin{bmatrix} 3 \\ 6 \\ 9 \end{bmatrix}$$
These same rules apply to matrices — addition is element-by-element, and scalar multiplication scales every entry.
# Vector addition — element by element
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print("a:", a)
print("b:", b)
print("a + b:", a + b) # [5, 7, 9]
print("a - b:", a - b) # [-3, -3, -3]
# Scalar multiplication — multiply every element by a number
print("\n3 * a:", 3 * a) # [3, 6, 9]
# Matrix addition
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print("\nMatrix A:")
print(A)
print("Matrix B:")
print(B)
print("A + B:")
print(A + B)
Matrix Multiplication (The Most Important Operation)¶
Matrix multiplication is different from element-wise addition — it's not simply multiplying corresponding entries. Instead, it follows a specific pattern called the dot product of rows and columns.
Let's start with a concrete example. Suppose we have:
$$\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix}$$
To compute $\mathbf{C} = \mathbf{A} \times \mathbf{B}$, we calculate each entry $c_{ij}$ by taking row $i$ of $\mathbf{A}$ and column $j$ of $\mathbf{B}$, multiplying corresponding elements, and adding them up:
$$c_{11} = (1 \times 5) + (2 \times 7) = 5 + 14 = 19$$ $$c_{12} = (1 \times 6) + (2 \times 8) = 6 + 16 = 22$$ $$c_{21} = (3 \times 5) + (4 \times 7) = 15 + 28 = 43$$ $$c_{22} = (3 \times 6) + (4 \times 8) = 18 + 32 = 50$$
So: $\mathbf{C} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}$
The general formula is:
$$(\mathbf{AB})_{ij} = \sum_{k=1}^{n} a_{ik} \cdot b_{kj}$$
The $\sum$ symbol means "add up" — we're summing over $k$ from 1 to $n$.
Size rule: For matrix multiplication to work, the number of columns in $\mathbf{A}$ must equal the number of rows in $\mathbf{B}$. If $\mathbf{A}$ is $m \times n$ and $\mathbf{B}$ is $n \times p$, the result is $m \times p$.
# Matrix multiplication in NumPy: use the @ operator
A = np.array([[1, 2],
[3, 4]])
B = np.array([[5, 6],
[7, 8]])
C = A @ B # Matrix multiplication
print("A @ B =")
print(C) # [[19, 22], [43, 50]]
# Let's verify by hand for c[0,0]:
c_00 = A[0, 0] * B[0, 0] + A[0, 1] * B[1, 0]
print(f"\nManual check: c[0,0] = {A[0,0]}*{B[0,0]} + {A[0,1]}*{B[1,0]} = {c_00}")
Element-wise (Hadamard) Product vs Matrix Multiplication¶
This is a common source of confusion. There are two different ways to multiply matrices:
- Matrix multiplication (
@in NumPy): follows the row-column dot product rule we just learned - Element-wise multiplication (
*in NumPy): simply multiplies corresponding entries
$$\text{Element-wise: } (\mathbf{A} \odot \mathbf{B})_{ij} = a_{ij} \cdot b_{ij}$$
The symbol $\odot$ represents element-wise multiplication. Let's see the difference:
A = np.array([[1, 2],
[3, 4]])
B = np.array([[5, 6],
[7, 8]])
print("Matrix multiplication (A @ B):")
print(A @ B) # [[19, 22], [43, 50]]
print("\nElement-wise multiplication (A * B):")
print(A * B) # [[5, 12], [21, 32]] — just 1*5, 2*6, 3*7, 4*8
print("\nThey give DIFFERENT results!")
print("This distinction is crucial in machine learning code.")
Why Matrix Multiplication Matters in ML¶
In machine learning, making a prediction is literally a matrix multiplication. Consider a simple model that predicts house prices:
$$\hat{y} = w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot x_3 + b$$
where $x_1, x_2, x_3$ are features (square footage, bedrooms, age) and $w_1, w_2, w_3$ are weights the model learns. The value $b$ is a bias term.
For a whole dataset at once, this becomes:
$$\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + b$$
This is a single matrix multiplication that computes predictions for ALL data points simultaneously!
# ML example: predicting house prices
# 4 houses, 3 features each
X = np.array([
[1500, 3, 20], # House 1
[1200, 2, 35], # House 2
[1800, 4, 10], # House 3
[900, 1, 50], # House 4
])
# Model weights (learned by the model — we'll just make some up)
w = np.array([100, 5000, -200]) # price per sqft, per bedroom, per year of age
b = 50000 # base price (bias)
# Predictions for ALL houses at once!
predictions = X @ w + b
print("Predictions:")
for i, price in enumerate(predictions):
print(f" House {i+1}: ${price:,.0f}")
# Without matrix multiplication, you'd need a loop:
print("\nSame result with a loop (slower, more code):")
for i in range(len(X)):
price = w[0]*X[i,0] + w[1]*X[i,1] + w[2]*X[i,2] + b
print(f" House {i+1}: ${price:,.0f}")
3. Dot Product and Vector Norms¶
Two more essential operations that show up everywhere in machine learning.
The Dot Product¶
The dot product of two vectors is a single number you get by multiplying corresponding entries and adding them up:
$$\mathbf{a} \cdot \mathbf{b} = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n = \sum_{i=1}^{n} a_i b_i$$
Example by hand:
$$\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \cdot \begin{bmatrix} 4 \\ 5 \\ 6 \end{bmatrix} = (1 \times 4) + (2 \times 5) + (3 \times 6) = 4 + 10 + 18 = 32$$
The dot product also has a beautiful geometric interpretation:
$$\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \, \|\mathbf{b}\| \cos\theta$$
where $\theta$ is the angle between the two vectors, and $\|\mathbf{a}\|$ is the length (norm) of vector $\mathbf{a}$.
What does this tell us?
- If the dot product is positive: the vectors point in roughly the same direction ($\theta < 90°$)
- If the dot product is zero: the vectors are perpendicular ($\theta = 90°$)
- If the dot product is negative: the vectors point in roughly opposite directions ($\theta > 90°$)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Dot product
dot_product = np.dot(a, b)
print(f"a · b = {dot_product}") # 32
# Manual calculation to verify
manual = a[0]*b[0] + a[1]*b[1] + a[2]*b[2]
print(f"Manual: {a[0]}×{b[0]} + {a[1]}×{b[1]} + {a[2]}×{b[2]} = {manual}")
Vector Norms (Measuring Length)¶
A norm measures the "size" or "length" of a vector. Think of it as the distance from the origin to the point represented by the vector.
L2 Norm (Euclidean Norm) — the "normal" distance¶
This is the distance formula you know from high school, extended to any number of dimensions:
$$\|\mathbf{x}\|_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2} = \sqrt{\sum_{i=1}^{n} x_i^2}$$
For a 2D vector $\mathbf{x} = [3, 4]$: $\|\mathbf{x}\|_2 = \sqrt{3^2 + 4^2} = \sqrt{9 + 16} = \sqrt{25} = 5$
This is just the Pythagorean theorem!
L1 Norm (Manhattan Norm) — the "city block" distance¶
Instead of going in a straight line, imagine walking along city blocks (only horizontal and vertical):
$$\|\mathbf{x}\|_1 = |x_1| + |x_2| + \cdots + |x_n| = \sum_{i=1}^{n} |x_i|$$
For $\mathbf{x} = [3, 4]$: $\|\mathbf{x}\|_1 = |3| + |4| = 7$
Why two norms? In ML, L1 and L2 norms are used in regularization — a technique to prevent models from becoming too complex. L1 regularization tends to make some weights exactly zero (feature selection), while L2 regularization makes weights small but rarely exactly zero.
x = np.array([3, 4])
# L2 norm (Euclidean) — the "straight line" distance
l2_norm = np.linalg.norm(x) # default is L2
print(f"Vector: {x}")
print(f"L2 norm: \u221a(3\u00b2 + 4\u00b2) = \u221a(9 + 16) = \u221a25 = {l2_norm}")
# L1 norm (Manhattan) — the "city block" distance
l1_norm = np.linalg.norm(x, ord=1)
print(f"L1 norm: |3| + |4| = {l1_norm}")
# Cosine similarity — how similar are two vectors in direction?
a = np.array([1, 0]) # points right
b = np.array([0, 1]) # points up
c = np.array([1, 1]) # points diagonally
cos_ab = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
cos_ac = np.dot(a, c) / (np.linalg.norm(a) * np.linalg.norm(c))
print(f"\nCosine similarity (right vs up): {cos_ab:.4f}") # 0 — perpendicular
print(f"Cosine similarity (right vs diagonal): {cos_ac:.4f}") # ~0.707 — 45 degrees
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# --- Left plot: Two 2D vectors and the angle between them ---
ax = axes[0]
v1 = np.array([3, 1])
v2 = np.array([1, 3])
# Draw the two vectors as arrows from the origin
ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1,
color='steelblue', linewidth=2, label=r'$\mathbf{a} = (3, 1)$')
ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1,
color='coral', linewidth=2, label=r'$\mathbf{b} = (1, 3)$')
# Draw an arc to show the angle theta between the two vectors
angle_a = np.arctan2(v1[1], v1[0]) # angle of v1 from x-axis
angle_b = np.arctan2(v2[1], v2[0]) # angle of v2 from x-axis
theta_range = np.linspace(angle_a, angle_b, 30)
arc_radius = 0.8
ax.plot(arc_radius * np.cos(theta_range), arc_radius * np.sin(theta_range),
color='gray', linewidth=1.5)
# Compute and label the angle
cos_sim_val = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
angle_deg = np.degrees(np.arccos(cos_sim_val))
mid_angle = (angle_a + angle_b) / 2
ax.text(1.1 * np.cos(mid_angle), 1.1 * np.sin(mid_angle),
f'{angle_deg:.1f}\u00b0', fontsize=12, ha='center')
ax.set_xlim(-0.5, 4)
ax.set_ylim(-0.5, 4)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.legend(fontsize=11)
ax.set_title('Two Vectors and the Angle Between Them', fontsize=13)
ax.set_xlabel('x')
ax.set_ylabel('y')
# --- Right plot: L1 and L2 unit circles ---
ax = axes[1]
theta = np.linspace(0, 2 * np.pi, 300)
# L2 unit circle: all points where x^2 + y^2 = 1 (a circle)
l2_x = np.cos(theta)
l2_y = np.sin(theta)
ax.plot(l2_x, l2_y, color='steelblue', linewidth=2, label=r'$L_2$ unit circle')
# L1 unit circle: all points where |x| + |y| = 1 (a diamond)
l1_x = np.array([1, 0, -1, 0, 1])
l1_y = np.array([0, 1, 0, -1, 0])
ax.plot(l1_x, l1_y, color='coral', linewidth=2, label=r'$L_1$ unit circle')
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-1.5, 1.5)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.legend(fontsize=11)
ax.set_title(r'$L_1$ vs $L_2$ Unit Circles', fontsize=13)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.tight_layout()
plt.show()
4. Special Matrices and Matrix Properties¶
Some matrices have special properties that make them particularly useful. Let's meet the most important ones.
The Identity Matrix¶
The identity matrix $\mathbf{I}$ is the matrix equivalent of the number 1. Just as $1 \times a = a$ for any number, $\mathbf{I} \times \mathbf{A} = \mathbf{A}$ for any matrix:
$$\mathbf{I}_3 = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$$
It has 1s on the diagonal and 0s everywhere else. The subscript 3 indicates it's a $3 \times 3$ identity matrix.
The Matrix Inverse¶
Remember how division works with numbers: $5 \times \dfrac{1}{5} = 1$? The inverse of a matrix $\mathbf{A}$, written $\mathbf{A}^{-1}$, is similar:
$$\mathbf{A} \mathbf{A}^{-1} = \mathbf{I}$$
Not all matrices have inverses (just like you can't divide by zero). A matrix without an inverse is called singular.
The Determinant¶
The determinant $\det(\mathbf{A})$ is a single number that tells you whether a matrix has an inverse:
- If $\det(\mathbf{A}) \neq 0$: the matrix has an inverse
- If $\det(\mathbf{A}) = 0$: the matrix is singular (no inverse)
For a $2 \times 2$ matrix:
$$\det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc$$
# Identity matrix
I = np.eye(3) # 3x3 identity matrix
print("Identity matrix:")
print(I)
# Any matrix times identity equals itself
A = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print("\nA @ I = A?", np.allclose(A @ I, A)) # True
# Matrix inverse
B = np.array([[2, 1],
[5, 3]])
B_inv = np.linalg.inv(B)
print("\nMatrix B:")
print(B)
print("Inverse of B:")
print(B_inv)
print("B @ B_inv (should be identity):")
print(np.round(B @ B_inv)) # round to avoid tiny floating point errors
# Determinant
det_B = np.linalg.det(B)
print(f"\nDeterminant of B: {det_B:.4f}")
print("Since det \u2260 0, B has an inverse \u2713")
The Normal Equation — Linear Regression in One Line¶
One of the most beautiful applications of linear algebra is solving linear regression exactly using the normal equation:
$$\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
Let's break this down piece by piece:
- $\mathbf{X}$ is your dataset matrix (data points × features)
- $\mathbf{y}$ is the vector of target values you want to predict
- $\mathbf{X}^T$ is the transpose of $\mathbf{X}$ (flip rows and columns)
- $(\mathbf{X}^T\mathbf{X})^{-1}$ is the inverse of $\mathbf{X}^T\mathbf{X}$
- $\mathbf{w}$ is the vector of weights that gives the best fit
This single equation finds the weights that minimize the prediction error — no iterative training needed!
# Linear regression using the normal equation
np.random.seed(42)
# Generate synthetic data: y = 3x + 2 + noise
x_data = np.random.rand(50) * 10 # 50 random x values between 0 and 10
y_data = 3 * x_data + 2 + np.random.randn(50) * 2 # y = 3x + 2 + noise
# Build the design matrix X (add a column of 1s for the bias/intercept term)
X = np.column_stack([np.ones(50), x_data]) # shape: (50, 2)
print("Design matrix X (first 5 rows):")
print(X[:5])
# Apply the normal equation: w = (X^T X)^{-1} X^T y
w = np.linalg.inv(X.T @ X) @ X.T @ y_data
print(f"\nEstimated weights: intercept = {w[0]:.4f}, slope = {w[1]:.4f}")
print(f"True values: intercept = 2.0000, slope = 3.0000")
# Plot the result
plt.figure(figsize=(8, 5))
plt.scatter(x_data, y_data, alpha=0.5, label='Data points')
x_line = np.linspace(0, 10, 100)
y_line = w[0] + w[1] * x_line
plt.plot(x_line, y_line, 'r-', linewidth=2, label=f'Fit: y = {w[1]:.2f}x + {w[0]:.2f}')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Linear Regression via the Normal Equation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Geometric View: Matrices as Transformations¶
Here's a powerful way to think about matrices: a matrix multiplication transforms space. When you multiply a vector by a matrix, you move it to a new position.
Let's see this visually by watching what happens to a unit square when we multiply it by different matrices:
# Visualize how different 2x2 matrices transform a unit square
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Define the unit square vertices (4 corners + close the shape)
square = np.array([[0, 0],
[1, 0],
[1, 1],
[0, 1],
[0, 0]]) # close the loop
# Three different transformation matrices
matrices = [
(np.array([[2, 0], [0, 1.5]]), "Scaling\n(stretch x by 2, y by 1.5)"),
(np.array([[np.cos(np.pi/4), -np.sin(np.pi/4)],
[np.sin(np.pi/4), np.cos(np.pi/4)]]), "Rotation\n(45\u00b0 counter-clockwise)"),
(np.array([[1, 0.5], [0, 1]]), "Shear\n(slant along x-axis)"),
]
for ax, (M, title) in zip(axes, matrices):
# Apply transformation: multiply each point by the matrix
transformed = (M @ square.T).T # transform all points
# Plot the original square in blue
ax.fill(square[:, 0], square[:, 1], alpha=0.3, color='steelblue', label='Original')
ax.plot(square[:, 0], square[:, 1], 'b-', linewidth=2)
# Plot the transformed shape in red
ax.fill(transformed[:, 0], transformed[:, 1], alpha=0.3, color='coral', label='Transformed')
ax.plot(transformed[:, 0], transformed[:, 1], 'r-', linewidth=2)
ax.set_xlim(-2, 3)
ax.set_ylim(-1, 3)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.legend(fontsize=9)
ax.set_title(title, fontsize=12)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.suptitle('Matrices as Geometric Transformations', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
5. Eigenvalues and Eigenvectors¶
This sounds complex, but the idea is surprisingly simple.
The Big Idea¶
We just saw that matrices transform vectors — they stretch, rotate, and skew them. But for any given matrix, there are some special vectors that don't change direction when transformed. They might get longer or shorter, but they keep pointing the same way.
These special vectors are called eigenvectors, and the amount they get scaled by is called the eigenvalue.
Mathematically:
$$\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$$
where:
- $\mathbf{A}$ is the matrix (the transformation)
- $\mathbf{v}$ is the eigenvector (the special direction that doesn't rotate)
- $\lambda$ (lambda) is the eigenvalue (how much the eigenvector gets scaled)
Think of it this way: Imagine pushing on a door. Most pushes make the door rotate. But if you push exactly along the hinge axis, the door doesn't rotate — it might compress or stretch, but it stays in the same direction. That direction is like an eigenvector.
Why does this matter for ML? The most important application is Principal Component Analysis (PCA) — a technique for reducing the number of features in your data. PCA finds the eigenvectors of the covariance matrix to identify the directions of maximum variation in your data.
# Find eigenvalues and eigenvectors
A = np.array([[2, 1],
[1, 3]])
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Matrix A:")
print(A)
print(f"\nEigenvalues: \u03bb\u2081 = {eigenvalues[0]:.4f}, \u03bb\u2082 = {eigenvalues[1]:.4f}")
print(f"\nEigenvectors:")
print(f" v\u2081 = {eigenvectors[:, 0]}")
print(f" v\u2082 = {eigenvectors[:, 1]}")
# Verify: A @ v should equal \u03bb * v
print("\nVerification:")
for i in range(2):
v = eigenvectors[:, i]
lam = eigenvalues[i]
Av = A @ v
lam_v = lam * v
print(f" A @ v{i+1} = {np.round(Av, 4)}")
print(f" \u03bb{i+1} \u00d7 v{i+1} = {np.round(lam_v, 4)}")
print(f" Equal? {np.allclose(Av, lam_v)}")
# Visualize: eigenvectors keep their direction after transformation
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
A = np.array([[2, 1],
[1, 3]])
eigenvalues, eigenvectors = np.linalg.eig(A)
# --- Left plot: before transformation ---
ax = axes[0]
# Plot eigenvectors (blue and coral)
colors = ['steelblue', 'coral']
for i in range(2):
v = eigenvectors[:, i]
ax.quiver(0, 0, v[0], v[1], angles='xy', scale_units='xy', scale=1,
color=colors[i], linewidth=2.5, label=f'Eigenvector v{i+1}')
# Also plot a non-eigenvector for comparison (green)
non_eigen = np.array([1, 0]) # a regular vector
ax.quiver(0, 0, non_eigen[0], non_eigen[1], angles='xy', scale_units='xy', scale=1,
color='green', linewidth=2, linestyle='dashed', label='Non-eigenvector')
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-1.5, 1.5)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.legend(fontsize=10)
ax.set_title('Before Transformation', fontsize=13)
ax.set_xlabel('x')
ax.set_ylabel('y')
# --- Right plot: after transformation (A @ v) ---
ax = axes[1]
for i in range(2):
v = eigenvectors[:, i]
Av = A @ v # = lambda * v — same direction, different length!
ax.quiver(0, 0, Av[0], Av[1], angles='xy', scale_units='xy', scale=1,
color=colors[i], linewidth=2.5,
label=f'A @ v{i+1} (scaled by \u03bb={eigenvalues[i]:.2f})')
# Faint dashed line showing original direction
ax.quiver(0, 0, v[0], v[1], angles='xy', scale_units='xy', scale=1,
color=colors[i], linewidth=1, alpha=0.3)
# The non-eigenvector CHANGES direction after transformation
non_eigen_transformed = A @ non_eigen
ax.quiver(0, 0, non_eigen_transformed[0], non_eigen_transformed[1],
angles='xy', scale_units='xy', scale=1,
color='green', linewidth=2, linestyle='dashed',
label='A @ non-eigen (direction CHANGED!)')
ax.quiver(0, 0, non_eigen[0], non_eigen[1], angles='xy', scale_units='xy', scale=1,
color='green', linewidth=1, alpha=0.3)
ax.set_xlim(-3.5, 3.5)
ax.set_ylim(-3.5, 3.5)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.legend(fontsize=9)
ax.set_title('After Transformation: A @ v', fontsize=13)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.suptitle('Eigenvectors Keep Their Direction — Other Vectors Do Not', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
6. Broadcasting and Vectorization¶
This section is less about math and more about practical programming for ML. The key idea: whenever possible, use NumPy's built-in operations instead of Python loops. This can make your code 10x to 100x faster.
Why Loops Are Slow and NumPy Is Fast¶
Python is a wonderful language, but it's slow at running loops over millions of numbers. Every iteration of a Python for loop has overhead — type checking, memory management, interpretation.
NumPy solves this by doing the looping in C code under the hood. When you write a + b with NumPy arrays, Python makes a single call to highly optimized C code that processes the entire array in one shot. This is called vectorization.
Broadcasting is a related concept: NumPy can automatically "stretch" arrays of different shapes to make operations work. For example, subtracting a mean vector $\bar{\mathbf{x}} \in \mathbb{R}^d$ from every row of a data matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ requires no loop — NumPy broadcasts the vector across all rows automatically.
Example: Pairwise Distances¶
A common ML task is computing the distance between every pair of data points. Given $n$ points, the pairwise Euclidean distance between point $i$ and point $j$ is:
$$D_{ij} = \|\mathbf{x}_i - \mathbf{x}_j\|_2$$
A naive approach uses two nested loops ($n^2$ iterations). A vectorized approach uses broadcasting to compute all distances at once. Let's compare:
import time
# Create a dataset: 500 points in 10 dimensions
n = 500
d = 10
X = np.random.randn(n, d)
# --- Method 1: Slow loop-based approach ---
start = time.time()
D_loop = np.zeros((n, n)) # pre-allocate the distance matrix
for i in range(n):
for j in range(n):
# Compute Euclidean distance between point i and point j
D_loop[i, j] = np.sqrt(np.sum((X[i] - X[j]) ** 2))
loop_time = time.time() - start
# --- Method 2: Fast vectorized approach using broadcasting ---
start = time.time()
# X[:, np.newaxis, :] has shape (500, 1, 10)
# X[np.newaxis, :, :] has shape (1, 500, 10)
# Broadcasting gives all pairwise differences: shape (500, 500, 10)
diff = X[:, np.newaxis, :] - X[np.newaxis, :, :]
D_vec = np.sqrt(np.sum(diff ** 2, axis=2)) # sum over features, take sqrt
vec_time = time.time() - start
print(f"Loop time: {loop_time:.4f} seconds")
print(f"Vectorized time: {vec_time:.4f} seconds")
print(f"Speedup: {loop_time / vec_time:.1f}x faster!")
print(f"Results match: {np.allclose(D_loop, D_vec)}")
Check Your Understanding¶
Test your knowledge with these exercises. Try to work through them on your own before looking at the solutions below.
Task 1: Given a dataset matrix $\mathbf{X}$ with 5 data points and 3 features, and a weight vector $\mathbf{w}$ with 3 weights, compute the predictions $\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + b$ where $b = 10$.
# Task 1 Solution: Matrix-vector prediction
np.random.seed(7)
# Create a dataset: 5 data points, 3 features each
X = np.array([
[2.0, 1.0, 3.0],
[1.5, -0.5, 2.0],
[3.0, 2.0, 1.0],
[0.5, 0.0, 4.0],
[2.5, 1.5, 2.5],
])
# Weight vector and bias
w = np.array([1.5, -2.0, 0.5])
b = 10 # bias term
# Compute predictions: y_hat = X @ w + b
y_hat = X @ w + b
print("Dataset X:")
print(X)
print(f"\nWeights w: {w}")
print(f"Bias b: {b}")
print(f"\nPredictions (X @ w + b): {y_hat}")
# Verify the first prediction by hand:
manual_pred = w[0]*X[0,0] + w[1]*X[0,1] + w[2]*X[0,2] + b
print(f"\nManual check for row 0: {w[0]}*{X[0,0]} + {w[1]}*{X[0,1]} + {w[2]}*{X[0,2]} + {b} = {manual_pred}")
Task 2: Compute the covariance matrix of a dataset manually using the formula:
$$\mathbf{C} = \dfrac{1}{n-1}(\mathbf{X} - \bar{\mathbf{X}})^T(\mathbf{X} - \bar{\mathbf{X}})$$
Then verify your result matches np.cov(X.T).
# Task 2 Solution: Manual covariance matrix computation
np.random.seed(42)
# Create a dataset: 200 samples, 4 features
n_samples = 200
n_features = 4
X = np.random.randn(n_samples, n_features)
# Step 1: Compute the mean of each feature (column-wise mean)
X_bar = X.mean(axis=0) # shape: (4,) — one mean per feature
print(f"Feature means: {X_bar}")
# Step 2: Center the data (subtract the mean from each column)
X_centered = X - X_bar # broadcasting handles this automatically!
# Step 3: Apply the covariance formula: C = (1/(n-1)) * X_centered^T @ X_centered
C_manual = (1 / (n_samples - 1)) * (X_centered.T @ X_centered)
print(f"\nCovariance matrix (manual):")
print(np.round(C_manual, 4))
# Step 4: Verify against NumPy's built-in function
C_numpy = np.cov(X, rowvar=False) # rowvar=False means rows are observations
print(f"\nCovariance matrix (np.cov):")
print(np.round(C_numpy, 4))
# Check if they match
print(f"\nResults match: {np.allclose(C_manual, C_numpy)}")
# The covariance matrix is always symmetric
print(f"Covariance matrix is symmetric: {np.allclose(C_manual, C_manual.T)}")