Calculus¶
Welcome! This notebook will teach you the calculus you need to understand how machine learning models actually learn. Don't worry if you haven't touched calculus since high school (or ever) — we'll build every idea from the ground up, with intuition first and formulas second.
Why does calculus matter for ML?¶
Think of it this way: a machine learning model starts out knowing nothing. It makes terrible predictions. Then, step by step, it gets better. But how does it know which direction "better" is?
That's where calculus comes in:
- Derivatives tell us how a function changes — if we tweak an input a tiny bit, how much does the output move?
- Gradients are derivatives for functions with many inputs — they tell us how to adjust every model parameter at once to reduce error.
- Gradient descent is the algorithm that uses gradients to iteratively improve model parameters — it's the engine behind virtually all of modern deep learning.
Without calculus, there would be no gradient descent. Without gradient descent, there would be no modern deep learning — no ChatGPT/Claude/Gemini, no image recognition, no self-driving cars.
Prerequisites¶
You only need:
- High school algebra — you're comfortable with expressions like $y = 3x + 2$ and can solve simple equations.
- Understanding of functions — you know that $y = f(x)$ means "$y$ depends on $x$", and you can read a graph of a function.
- Basic exponents and logarithms — you know that $x^2$ means $x \times x$ and that $\ln$ is the natural logarithm.
Let's dive in!
import numpy as np
import matplotlib.pyplot as plt
What is a Derivative?¶
The Big Idea: Rate of Change¶
Let's start with something you already understand: speed.
Imagine you drive from home to a store that's 30 miles away. The trip takes you 30 minutes (0.5 hours). Your average speed is:
$$\text{Average speed} = \dfrac{\text{distance traveled}}{\text{time elapsed}} = \dfrac{30 \text{ miles}}{0.5 \text{ hours}} = 60 \text{ mph}$$
But your speed wasn't actually 60 mph the whole time. At a red light, it was 0. On the highway, it was maybe 70. Pulling into the parking lot, it was 5.
Speed at any given instant is what we call the instantaneous rate of change of your position. And that's exactly what a derivative is — the instantaneous rate of change of one quantity with respect to another.
From Average to Instantaneous¶
Think of it this way: to get a better estimate of your speed at a specific moment, you could measure your position over a shorter and shorter time interval. Over 1 second, over 0.1 seconds, over 0.001 seconds... The shorter the interval, the closer you get to your true instantaneous speed.
This is exactly what the derivative does for any function.
The Formal Definition¶
For a function $y = f(x)$, the derivative $f'(x)$ (read "f prime of x") tells you:
"If I change $x$ by a tiny amount, how much does $y$ change?"
Here is the formal definition:
$$f'(x) = \lim_{h \to 0} \dfrac{f(x + h) - f(x)}{h}$$
Let's break down every piece of this formula:
| Symbol | Meaning |
|---|---|
| $x$ | The point where we want to know the rate of change |
| $h$ | A tiny change in $x$ (think of it as a small nudge) |
| $f(x + h)$ | The value of the function after we nudge $x$ by $h$ |
| $f(x + h) - f(x)$ | How much the output changed because of our nudge |
| $\dfrac{f(x+h) - f(x)}{h}$ | The rate of change: output change divided by input change |
| $\lim_{h \to 0}$ | Take the limit as $h$ shrinks to zero — make the nudge infinitely small |
Geometric Meaning¶
Geometrically, the derivative $f'(x)$ is the slope of the tangent line to the curve $y = f(x)$ at the point $(x, f(x))$.
- If the slope is positive, the function is going uphill (increasing).
- If the slope is negative, the function is going downhill (decreasing).
- If the slope is zero, the function is momentarily flat — this could be a peak, a valley, or a flat spot.
# Define f(x) = -x^3 + 3x (has a peak, valley, and clear slopes)
# defining 300 samples including 2.5 with `endpoint=True`
x = np.linspace(-2, 2.5, 300)
f = lambda x: -x**3 + 3 * x
# derivative
fp = lambda fx: -3 * fx**2 + 3
y = f(x)
# Three key points
points = [
(-1.2, "Positive slope\n(going uphill)", "#2E7D32"),
(1.0, "Zero slope\n(peak)", "#E65100"),
(2.0, "Negative slope\n(going downhill)", "#C62828"),
]
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y, color="#1565C0", linewidth=2.5, label="$f(x) = -x^3 + 3x$")
for x0, label, color in points:
y0 = f(x0)
slope = fp(x0)
# Tangent line: y - y0 = slope * (x - x0)
tx = np.linspace(x0 - 0.8, x0 + 0.8, 50)
ty = y0 + slope * (tx - x0)
ax.plot(tx, ty, color=color, linewidth=2, linestyle="--")
ax.plot(x0, y0, "o", color=color, markersize=10, zorder=5)
ax.annotate(f"{label}\nslope = {slope:.1f}",
xy=(x0, y0),
xytext=(15, 25),
textcoords="offset points",
fontsize=10,
color=color,
fontweight="bold",
arrowprops=dict(arrowstyle="->", color=color, lw=1.5))
ax.set_xlabel("x", fontsize=12)
ax.set_ylabel("f(x)", fontsize=12)
ax.set_title("Geometric Meaning of the Derivative: Slope of the Tangent Line",
fontsize=14,
fontweight="bold")
ax.axhline(0, color="gray", linewidth=0.5)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Quick Example by Hand¶
Let's find the derivative of $f(x) = x^2$ using the definition:
$$f'(x) = \lim_{h \to 0} \dfrac{(x+h)^2 - x^2}{h}$$
Step 1: Expand $(x+h)^2 = x^2 + 2xh + h^2$
$$= \lim_{h \to 0} \dfrac{x^2 + 2xh + h^2 - x^2}{h}$$
Step 2: Cancel the $x^2$ terms:
$$= \lim_{h \to 0} \dfrac{2xh + h^2}{h}$$
Step 3: Factor out $h$:
$$= \lim_{h \to 0} (2x + h)$$
Step 4: Let $h \to 0$:
$$f'(x) = 2x$$
So the derivative of $x^2$ is $2x$. At $x = 3$, the slope is $2(3) = 6$. At $x = -1$, the slope is $2(-1) = -2$. At $x = 0$, the slope is $0$ (the bottom of the parabola).
Common Derivatives You'll Need¶
You don't need to derive these from scratch every time. Here are the most important derivatives for machine learning, with an explanation for each:
| Function $f(x)$ | Derivative $f'(x)$ | Example | Why it works |
|---|---|---|---|
| $x^n$ (power rule) | $nx^{n-1}$ | $f(x) = x^3 \Rightarrow f'(x) = 3x^2$ | Bring the exponent down, reduce it by 1 |
| $e^x$ (exponential) | $e^x$ | $f(x) = e^x \Rightarrow f'(x) = e^x$ | The only function that equals its own derivative! |
| $\ln(x)$ (natural log) | $\dfrac{1}{x}$ | $f(x) = \ln(x) \Rightarrow f'(x) = \dfrac{1}{x}$ | The inverse of the exponential |
| $c$ (constant) | $0$ | $f(x) = 7 \Rightarrow f'(x) = 0$ | Constants don't change, so rate of change is 0 |
| $cx$ (linear) | $c$ | $f(x) = 5x \Rightarrow f'(x) = 5$ | A line has constant slope everywhere |
Notice that these rules can be combined. For example:
- $f(x) = 3x^2 + 5x - 7$
- $f'(x) = 3 \cdot 2x + 5 \cdot 1 - 0 = 6x + 5$
This works because the derivative of a sum is the sum of the derivatives, and constants just multiply through.
Computing Derivatives Numerically¶
Sometimes you have a function that's hard to differentiate by hand. In those cases, we can approximate the derivative using a computer.
The most common method is the central difference formula:
$$f'(x) \approx \dfrac{f(x + h) - f(x - h)}{2h}$$
Let's break this down:
| Symbol | Meaning |
|---|---|
| $h$ | A very small number (we'll use $h = 10^{-7} = 0.0000001$) |
| $f(x + h)$ | The function value slightly to the right of $x$ |
| $f(x - h)$ | The function value slightly to the left of $x$ |
| $2h$ | The total width of the interval (from $x - h$ to $x + h$) |
Why central differences instead of just forward differences?
The simpler formula $\dfrac{f(x+h) - f(x)}{h}$ (forward difference) only looks at the function to the right of $x$. The central difference looks at both sides equally, which cancels out more error terms. The result is that central differences are accurate to order $h^2$ instead of just $h$ — meaning for $h = 10^{-7}$, the error is roughly $10^{-14}$ instead of $10^{-7}$. Much better!
from typing import Callable
def numerical_derivative(
f: Callable,
x: np.float64,
h: np.float64 = np.float64(1e-7),
) -> np.float64:
"""Approximate the derivative of `f` at point `x` using the central difference
formula.
Args:
f: function that takes a single number and returns a number
x: the point at which to compute the derivative
h: the tiny nudge size (default 1e-7)
Returns:
The approximate value of f'(x)
"""
# Evaluate f slightly to the right and left of x
f_right = f(x + h) # f(x + h)
f_left = f(x - h) # f(x - h)
# The central difference formula
return (f_right - f_left) / (2 * h)
# Test 1: f(x) = x^2, derivative should be 2x
# Define: f(x) = x^2
f = lambda x: x**2
x_test = np.float64(3.0)
numerical = numerical_derivative(f, x_test)
# f'(x) = 2x = 2(3) = 6
analytical = 2 * x_test
print("Test 1: f(x) = x^2 at x = 3.0")
print(f" Numerical derivative: {numerical:.10f}")
print(f" Analytical derivative: {analytical:.10f}")
print(f" Match: {np.isclose(numerical, analytical)}")
Test 1: f(x) = x^2 at x = 3.0 Numerical derivative: 5.9999999902 Analytical derivative: 6.0000000000 Match: True
# Test 2: f(x) = x^3, derivative should be 3x^2
# Define: f(x) = x^3
f = lambda x: x**3
x_test = np.float64(2.0)
numerical = numerical_derivative(f, x_test)
# f'(x) = 3x^2 = 3(4) = 12
analytical = 3 * x_test**2
print("\nTest 2: f(x) = x^3 at x = 2.0")
print(f" Numerical derivative: {numerical:.10f}")
print(f" Analytical derivative: {analytical:.10f}")
print(f" Match: {np.isclose(numerical, analytical)}")
Test 2: f(x) = x^3 at x = 2.0 Numerical derivative: 11.9999999937 Analytical derivative: 12.0000000000 Match: True
# Test 3: f(x) = e^x, derivative should be e^x
# Define: f(x) = e^x
f = lambda x: np.exp(x)
x_test = np.float64(1.0)
numerical = numerical_derivative(f, x_test)
# f'(x) = e^x = e^1 ≈ 2.71828
analytical = np.exp(x_test)
print("\nTest 3: f(x) = e^x at x = 1.0")
print(f" Numerical derivative: {numerical:.10f}")
print(f" Analytical derivative: {analytical:.10f}")
print(f" Match: {np.isclose(numerical, analytical)}")
Test 3: f(x) = e^x at x = 1.0 Numerical derivative: 2.7182818285 Analytical derivative: 2.7182818285 Match: True
# Test 4: f(x) = ln(x), derivative should be 1/x
# Define: f(x) = ln(x)
f = lambda x: np.log(x)
x_test = np.float64(2.0)
numerical = numerical_derivative(f, x_test)
# f'(x) = 1/x = 1/2 = 0.5
analytical = 1.0 / x_test
print("\nTest 4: f(x) = ln(x) at x = 2.0")
print(f" Numerical derivative: {numerical:.10f}")
print(f" Analytical derivative: {analytical:.10f}")
print(f" Match: {np.isclose(numerical, analytical)}")
Test 4: f(x) = ln(x) at x = 2.0 Numerical derivative: 0.4999999997 Analytical derivative: 0.5000000000 Match: True
Let's visualize $f(x) = x^2$ and its tangent lines at three different points.
# The tangent line at a point (x0, f(x0)) has the equation:
# y = f'(x0) * (x - x0) + f(x0)
# y = 2*x0 * (x - x0) + x0^2
# Create x values for the main curve
# 300 points from -3 to 3 using default `endpoint=True`
x = np.linspace(-3, 3, 300)
# f(x) = x^2
y = x**2
# Create the figure
fig, ax = plt.subplots(figsize=(9, 6))
# Plot the parabola
ax.plot(x, y, 'b-', linewidth=2.5, label='$f(x) = x^2$')
# Colors for each tangent line: red, green, purple
colors = ['#e74c3c', '#2ecc71', '#9b59b6']
# Draw tangent lines at x = -2, 0, and 2
for x0, color in zip([-2, 0, 2], colors):
# f'(x0) = 2*x0
slope = 2 * x0
# f(x0) = x0^2
y0 = x0**2
# Create points for the tangent line (short segment around x0)
tangent_x = np.linspace(x0 - 1.5, x0 + 1.5, 50)
# tangent line equation
tangent_y = slope * (tangent_x - x0) + y0
# Plot the tangent line
ax.plot(tangent_x,
tangent_y,
'--',
color=color,
linewidth=2,
label=f'Tangent at x={x0} (slope={slope})')
# Mark the point of tangency with a black dot
ax.plot(x0, y0, 'ko', markersize=7, zorder=5)
ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('f(x)', fontsize=12)
ax.set_title('The Derivative as the Slope of the Tangent Line', fontsize=14)
ax.legend(fontsize=10, loc='upper center')
ax.set_ylim(-3, 10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Notice how the tangent line captures the local behavior of the curve:
- At $x = -2$: $\text{slope} = -4$ (steeply downhill, going left to right)
- At $x = 0$: $\text{slope} = 0$ (perfectly flat — the bottom of the bowl)
- At $x = 2$: $\text{slope} = 4$ (steeply uphill)
The Chain Rule — The Most Important Rule for Deep Learning¶
Composing Functions: One Thing Inside Another¶
In real life and in machine learning, we constantly compose functions — the output of one function becomes the input of another.
Think of it this way: Suppose a car travels 2 times as fast as a bicycle, and the bicycle travels 3 times as fast as walking. How fast is the car compared to walking?
$$2 \times 3 = 6 \text{ times as fast}$$
Rates of change multiply when you chain them together. That's the chain rule!
The Chain Rule, Formally¶
If $y = f(g(x))$ — that is, $y$ depends on $g$, and $g$ depends on $x$ — then:
$$\dfrac{dy}{dx} = \dfrac{dy}{dg} \cdot \dfrac{dg}{dx} = f'(g(x)) \cdot g'(x)$$
In words: differentiate the outer function (evaluated at the inner function) times the derivative of the inner function.
| Symbol | Meaning |
|---|---|
| $f$ | The "outer" function — the last operation applied |
| $g$ | The "inner" function — the first operation applied |
| $f'(g(x))$ | Derivative of the outer, evaluated at the inner |
| $g'(x)$ | Derivative of the inner |
Complete Worked Example¶
Let's differentiate $y = (3x + 2)^2$ step by step.
Step 1: Identify the inner and outer functions
- Inner function: $u = g(x) = 3x + 2$
- Outer function: $y = f(u) = u^2$
Step 2: Differentiate each piece
- $\dfrac{dy}{du} = 2u = 2(3x + 2)$ — power rule on the outer
- $\dfrac{du}{dx} = 3$ — derivative of the inner
Step 3: Multiply them together (chain rule)
$$\dfrac{dy}{dx} = \dfrac{dy}{du} \cdot \dfrac{du}{dx} = 2(3x + 2) \cdot 3 = 6(3x + 2)$$
Let's verify: at $x = 1$, the derivative should be $6(3(1) + 2) = 6(5) = 30$.
Why the Chain Rule Matters for Deep Learning¶
A neural network is essentially a giant composition of functions: Layer 1 feeds into Layer 2, which feeds into Layer 3, and so on. The chain rule is THE mathematical foundation of backpropagation — the algorithm that allows neural networks to learn.
Every single time a model learns from its mistakes, the chain rule is being applied — potentially millions of times per second.
The Product Rule¶
Sometimes you need to differentiate a product of two functions. The product rule says:
$$\dfrac{d}{dx}[f(x) \cdot g(x)] = f'(x) \cdot g(x) + f(x) \cdot g'(x)$$
Think of it this way: if both $f$ and $g$ are changing, the total change comes from two sources — $f$ changing while $g$ stays put, and $g$ changing while $f$ stays put.
Worked Example¶
Differentiate $y = x^2 \cdot e^x$.
- Let $f(x) = x^2$ and $g(x) = e^x$
- $f'(x) = 2x$ and $g'(x) = e^x$
Apply the product rule:
$$y' = f'(x) \cdot g(x) + f(x) \cdot g'(x) = 2x \cdot e^x + x^2 \cdot e^x = e^x(2x + x^2)$$
We can verify at $x = 1$: $y'(1) = e^1(2 + 1) = 3e \approx 8.1548$.
Worked Example: The Sigmoid Derivative¶
The sigmoid function is one of the most important functions in all of deep learning. It squishes any real number into the range $(0, 1)$, making it perfect for modeling probabilities:
$$\sigma(z) = \dfrac{1}{1 + e^{-z}}$$
| Symbol | Meaning |
|---|---|
| $z$ | The input (could be any real number, from $-\infty$ to $+\infty$) |
| $e$ | Euler's number ($\approx 2.71828$) |
| $e^{-z}$ | $e$ raised to the power of negative $z$ (always positive) |
| $\sigma(z)$ | The output, always between 0 and 1 |
Let's derive its derivative step by step using the chain rule.
Step 1: Rewrite as a power: $\sigma(z) = (1 + e^{-z})^{-1}$
Step 2: Identify inner and outer functions:
- Inner: $u = 1 + e^{-z}$
- Outer: $\sigma = u^{-1}$
Step 3: Apply the chain rule:
$$\dfrac{d\sigma}{dz} = \underbrace{\dfrac{d}{du}(u^{-1})}_{\text{outer derivative}} \cdot \underbrace{\dfrac{du}{dz}}_{\text{inner derivative}}$$
- Outer derivative: $\dfrac{d}{du}(u^{-1}) = -u^{-2} = -(1 + e^{-z})^{-2}$
- Inner derivative: $\dfrac{du}{dz} = \dfrac{d}{dz}(1 + e^{-z}) = 0 + e^{-z} \cdot (-1) = -e^{-z}$
Notice we used the chain rule again inside the inner derivative: the derivative of $e^{-z}$ with respect to $z$ is $e^{-z} \cdot (-1)$.
Step 4: Multiply:
$$\sigma'(z) = -(1 + e^{-z})^{-2} \cdot (-e^{-z}) = \dfrac{e^{-z}}{(1 + e^{-z})^2}$$
Step 5: Simplify to the elegant form.
Notice that:
- $\sigma(z) = \dfrac{1}{1 + e^{-z}}$
- $1 - \sigma(z) = 1 - \dfrac{1}{1 + e^{-z}} = \dfrac{e^{-z}}{1 + e^{-z}}$
So:
$$\dfrac{e^{-z}}{(1 + e^{-z})^2} = \dfrac{1}{1 + e^{-z}} \cdot \dfrac{e^{-z}}{1 + e^{-z}} = \sigma(z) \cdot (1 - \sigma(z))$$
$$\boxed{\sigma'(z) = \sigma(z)(1 - \sigma(z))}$$
This is a beautifully elegant result: you can compute the derivative of the sigmoid using just its output value! If you already know $\sigma(z) = 0.8$, then the derivative is $0.8 \times 0.2 = 0.16$. No need to redo the whole computation. This makes neural networks much more efficient.
def sigmoid(z: np.float64 | np.ndarray) -> np.float64 | np.ndarray:
"""The sigmoid activation function.
Squishes any real number into the range (0, 1).
Args:
z: a number or NumPy array
Returns:
sigma(z) = 1 / (1 + e^(-z))
"""
return 1 / (1 + np.exp(-z))
def sigmoid_derivative(z: np.float64 | np.ndarray) -> np.float64 | np.ndarray:
"""The derivative of the sigmoid function, using the elegant formula:
sigma'(z) = sigma(z) * (1 - sigma(z))
Args:
z: a number or NumPy array
Returns:
The derivative of sigmoid at z
"""
# compute sigmoid(z)
s = sigmoid(z)
# use the elegant formula
return s * (1 - s)
# Verify that our analytical derivative matches the numerical one
# Test at several points
test_points = [-2.0, -1.0, 0.0, 1.0, 2.0]
print("Verifying sigmoid derivative at several points:")
print()
print(f"{'z':>6} | {'Analytical':>12} | {'Numerical':>12} | {'Match':>6}")
print("-" * 50)
for z in test_points:
analytical = sigmoid_derivative(np.float64(z))
numerical = numerical_derivative(sigmoid, np.float64(z))
# Check if they agree
match = np.isclose(analytical, numerical)
print(
f"{z:>6.1f} | {analytical:>12.8f} | {numerical:>12.8f} | {str(match):>6}")
print()
print("All values match — our derivation is correct!")
Verifying sigmoid derivative at several points:
z | Analytical | Numerical | Match
--------------------------------------------------
-2.0 | 0.10499359 | 0.10499359 | True
-1.0 | 0.19661193 | 0.19661193 | True
0.0 | 0.25000000 | 0.25000000 | True
1.0 | 0.19661193 | 0.19661193 | True
2.0 | 0.10499359 | 0.10499358 | True
All values match — our derivation is correct!
# Create z values from -6 to 6 using `endpoint=True`
# (covers the interesting range of sigmoid)
z = np.linspace(-6, 6, 300)
# Create two side-by-side plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: The sigmoid function itself
ax1.plot(z, sigmoid(z), 'b-', linewidth=2.5)
ax1.set_title('Sigmoid Function $\\sigma(z)$', fontsize=14)
ax1.set_xlabel('z', fontsize=12)
ax1.set_ylabel('$\\sigma(z)$', fontsize=12)
# Add reference lines to show the output is always between 0 and 1
ax1.axhline(y=0.0, color='gray', linestyle=':', alpha=0.4)
ax1.axhline(y=0.5, color='gray', linestyle=':', alpha=0.4)
ax1.axhline(y=1.0, color='gray', linestyle=':', alpha=0.4)
ax1.axvline(x=0, color='gray', linestyle=':', alpha=0.4)
# Annotate key properties
ax1.annotate('Output always\nbetween 0 and 1', xy=(-5, 0.5),
fontsize=10, color='blue',
bbox=dict(boxstyle='round,pad=0.3', facecolor='lightyellow'))
ax1.annotate('$\\sigma(0) = 0.5$', xy=(0, 0.5), xytext=(1.5, 0.3),
arrowprops=dict(arrowstyle='->', color='black'), fontsize=10)
ax1.grid(True, alpha=0.3)
# Right plot: The sigmoid derivative
ax2.plot(z, sigmoid_derivative(z), 'r-', linewidth=2.5)
ax2.set_title("Sigmoid Derivative $\\sigma'(z) = \\sigma(z)(1 - \\sigma(z))$", fontsize=14)
ax2.set_xlabel('z', fontsize=12)
ax2.set_ylabel("$\\sigma'(z)$", fontsize=12)
ax2.axvline(x=0, color='gray', linestyle=':', alpha=0.4)
# Annotate the maximum and the vanishing gradient problem
ax2.annotate('Maximum = 0.25 at z=0\n(steepest change)', xy=(0, 0.25),
xytext=(2.0, 0.20),
arrowprops=dict(arrowstyle='->', color='black'), fontsize=10)
ax2.annotate('Gradient vanishes!\n(near zero for large |z|)',
xy=(4.5, 0.01), fontsize=9, color='red',
bbox=dict(boxstyle='round,pad=0.3', facecolor='lightyellow'))
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Key observations:
- The sigmoid squishes any input into $(0, 1)$ — perfect for probabilities.
- The derivative is largest at $z=0$ $(0.25)$ and shrinks toward $0$ for large $|z|$.
- This 'vanishing gradient' problem is why deep networks sometimes struggle to learn with sigmoid — the gradients become too tiny for the weights to update.
Partial Derivatives — Functions with Multiple Inputs¶
When Functions Have More Than One Input¶
So far, we've looked at functions of a single variable: $f(x)$. But in machine learning, we almost always deal with functions that depend on many variables. A neural network's loss depends on thousands or millions of parameters!
Think of it this way: The temperature in your room depends on at least two things:
- The thermostat setting (you control this)
- The outside temperature (you don't control this)
A partial derivative asks a very specific question: "If I change just one of these inputs — say, I turn up the thermostat by 1 degree — while keeping everything else fixed, how much does the room temperature change?"
The Definition¶
For a function $f(x, y)$ that depends on two variables, the partial derivative with respect to $x$ is:
$$\dfrac{\partial f}{\partial x} = \lim_{h \to 0} \dfrac{f(x + h, \ y) - f(x, \ y)}{h}$$
| Symbol | Meaning |
|---|---|
| $\partial$ | The "partial" symbol (curly d) — means we're changing only ONE variable |
| $\dfrac{\partial f}{\partial x}$ | "The partial derivative of $f$ with respect to $x$" |
| $y$ stays fixed | We treat $y$ as a constant when differentiating with respect to $x$ |
The practical recipe is simple: to find $\dfrac{\partial f}{\partial x}$, just treat every other variable as a constant and use all the same derivative rules you already know!
Complete Worked Example¶
Let $f(x, y) = x^2 y + 3y^2$.
Finding $\dfrac{\partial f}{\partial x}$ (treat $y$ as a constant):
- The term $x^2 y$: since $y$ is just a constant multiplier, $\dfrac{\partial}{\partial x}(x^2 y) = y \cdot 2x = 2xy$
- The term $3y^2$: this has no $x$ in it at all, so it's a constant, and $\dfrac{\partial}{\partial x}(3y^2) = 0$
$$\dfrac{\partial f}{\partial x} = 2xy$$
Finding $\dfrac{\partial f}{\partial y}$ (treat $x$ as a constant):
- The term $x^2 y$: since $x^2$ is just a constant multiplier, $\dfrac{\partial}{\partial y}(x^2 y) = x^2 \cdot 1 = x^2$
- The term $3y^2$: $\dfrac{\partial}{\partial y}(3y^2) = 3 \cdot 2y = 6y$
$$\dfrac{\partial f}{\partial y} = x^2 + 6y$$
Verification at point $(2, 3)$:
$\dfrac{\partial f}{\partial x}\bigg|_{(2,3)} = 2(2)(3) = 12$
This means: if we increase $x$ slightly from 2 (keeping $y = 3$), $f$ increases at a rate of 12.$\dfrac{\partial f}{\partial y}\bigg|_{(2,3)} = 2^2 + 6(3) = 4 + 18 = 22$
This means: if we increase $y$ slightly from 3 (keeping $x = 2$), $f$ increases at a rate of 22.
from typing import List
def partial_derivative(f: Callable,
var_index: int,
point: List[float] | np.ndarray,
h: float = 1e-7) -> np.float64:
"""Compute the partial derivative of f with respect to the variable
at position var_index, evaluated at the given point.
Args:
f: function that takes a NumPy array and returns a number
var_index: which variable to differentiate with respect to (0, 1, 2, ...)
point: the point at which to evaluate (list or array)
h: tiny nudge size for numerical approximation
Returns:
The approximate partial derivative
"""
point = np.array(point, dtype=float)
# Create copies of the point, nudged forward and backward
point_fwd = point.copy()
point_bwd = point.copy()
# Nudge only the variable we care about
point_fwd[var_index] += h
point_bwd[var_index] -= h
# Central difference formula
return (f(point_fwd) - f(point_bwd)) / (2 * h)
# Define our test function: f(x, y) = x^2 * y + 3 * y^2
def f_test(p):
"""f(x, y) = x^2 * y + 3 * y^2, where p = [x, y]"""
x, y = p[0], p[1]
return x**2 * y + 3 * y**2
# Evaluate at point (2, 3)
point = [2.0, 3.0]
# Compute numerical partial derivatives
df_dx_num = partial_derivative(f_test, 0,
point) # Partial w.r.t. x (var_index=0)
df_dy_num = partial_derivative(f_test, 1,
point) # Partial w.r.t. y (var_index=1)
# Compute analytical partial derivatives for comparison
x_val, y_val = point[0], point[1]
# df/dx = 2xy = 2(2)(3) = 12
df_dx_analytical = 2 * x_val * y_val
# df/dy = x^2 + 6y = 4 + 18 = 22
df_dy_analytical = x_val**2 + 6 * y_val
# Print results
print(f"f(x, y) = x^2 * y + 3y^2 at point ({point[0]}, {point[1]})")
print(f"f({point[0]}, {point[1]}) = {f_test(np.array(point)):.1f}")
print()
print(f"Partial derivative with respect to x:")
print(f" Numerical: {df_dx_num:.6f}")
print(f" Analytical: {df_dx_analytical:.6f} (2xy = 2*2*3 = 12)")
print(f" Match: {np.isclose(df_dx_num, df_dx_analytical)}")
print()
print(f"Partial derivative with respect to y:")
print(f" Numerical: {df_dy_num:.6f}")
print(f" Analytical: {df_dy_analytical:.6f} (x^2 + 6y = 4 + 18 = 22)")
print(f" Match: {np.isclose(df_dy_num, df_dy_analytical)}")
f(x, y) = x^2 * y + 3y^2 at point (2.0, 3.0) f(2.0, 3.0) = 39.0 Partial derivative with respect to x: Numerical: 12.000000 Analytical: 12.000000 (2xy = 2*2*3 = 12) Match: True Partial derivative with respect to y: Numerical: 22.000000 Analytical: 22.000000 (x^2 + 6y = 4 + 18 = 22) Match: True
The Gradient — Direction of Steepest Change¶
From Partial Derivatives to the Gradient¶
Now that we know how to compute partial derivatives (the rate of change along each input direction separately), we can put them all together into a single object: the gradient.
Think of it this way: Imagine you're standing on a hilly landscape and it's foggy. You want to know: "Which direction is the steepest uphill from where I'm standing?" You could test the slope in the north-south direction and the east-west direction separately (those are partial derivatives), and then combine them to find the true steepest direction. That combined result is the gradient.
Definition¶
For a function $f : \mathbb{R}^n \to \mathbb{R}$ (a function that takes $n$ inputs and returns a single number), the gradient is the vector of ALL its partial derivatives:
$$\nabla f(\mathbf{x}) = \begin{bmatrix} \dfrac{\partial f}{\partial x_1} \\[8pt] \dfrac{\partial f}{\partial x_2} \\[4pt] \vdots \\[4pt] \dfrac{\partial f}{\partial x_n} \end{bmatrix}$$
| Symbol | Meaning |
|---|---|
| $\nabla$ | "Nabla" or "del" — the gradient operator (an upside-down triangle) |
| $\nabla f(\mathbf{x})$ | The gradient of $f$ evaluated at the point $\mathbf{x}$ |
| $\mathbb{R}^n$ | The set of all vectors with $n$ real-number components |
| $\mathbf{x}$ | A vector of inputs $[x_1, x_2, \ldots, x_n]^T$ (bold = vector) |
Key Properties of the Gradient¶
- The gradient points in the direction of steepest ascent — the direction where $f$ increases the fastest.
- The negative gradient $-\nabla f$ points in the direction of steepest descent — the direction where $f$ decreases the fastest.
- The magnitude $\|\nabla f(\mathbf{x})\|$ tells you how steep the slope is in that direction.
For machine learning, property 2 is the most important: we want to minimize a loss function, so we move in the direction of the negative gradient.
The MSE Gradient: A Key Formula for Linear Regression¶
In linear regression, we have:
- A data matrix $\mathbf{X} \in \mathbb{R}^{N \times d}$ — $N$ data points, each with $d$ features
- A target vector $\mathbf{y} \in \mathbb{R}^N$ — the values we want to predict
- A weight vector $\mathbf{w} \in \mathbb{R}^d$ — the parameters we want to learn
The mean squared error (MSE) loss measures how bad our predictions are:
$$L = \dfrac{1}{N} \sum_{i=1}^{N} (y_i - \mathbf{w}^T\mathbf{x}_i)^2$$
| Symbol | Meaning |
|---|---|
| $N$ | Number of data points |
| $y_i$ | True target value for data point $i$ |
| $\mathbf{w}^T\mathbf{x}_i$ | Our prediction for data point $i$ (dot product of weights and features) |
| $(y_i - \mathbf{w}^T\mathbf{x}_i)^2$ | Squared error for data point $i$ |
In matrix notation, the loss can be written compactly as:
$$L = \dfrac{1}{N} (\mathbf{y} - \mathbf{X}\mathbf{w})^T(\mathbf{y} - \mathbf{X}\mathbf{w})$$
Taking the gradient with respect to $\mathbf{w}$ (using matrix calculus rules) gives:
$$\nabla_{\mathbf{w}} L = -\dfrac{2}{N} \mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w})$$
Let's break this result down:
| Piece | Meaning |
|---|---|
| $\mathbf{y} - \mathbf{X}\mathbf{w}$ | The residuals — the errors between true values and predictions |
| $\mathbf{X}^T(\cdots)$ | Multiplying by $\mathbf{X}^T$ correlates each feature with the errors |
| $-\dfrac{2}{N}$ | Scaling factor from the derivative of the squared term and the mean |
The gradient tells us exactly how to adjust each weight to reduce the total error.
from numpy.typing import NDArray
def compute_gradient(X: NDArray, y: np.ndarray, w: np.ndarray) -> np.ndarray:
"""Compute the analytical gradient of the MSE loss with respect to w.
Formula: grad = -(2/N) * X^T * (y - X*w)
Args:
X: data matrix, shape (N, d)
y: target vector, shape (N,)
w: weight vector, shape (d,)
Returns:
The gradient vector, shape (d,)
"""
n = len(y)
# X*w: our predictions (@ = matrix multiply)
predictions = X @ w
# How far off we are
residuals = y - predictions
# The MSE gradient formula
gradient = -2 / n * (X.T @ residuals)
return gradient
def numerical_gradient(loss_fn: Callable,
w: np.ndarray,
h: float = 1e-7) -> np.ndarray:
"""Compute the gradient numerically using central differences.
This is slower but gives us a way to verify analytical gradients.
Args:
loss_fn: function that takes w and returns a scalar loss
w: the point at which to compute the gradient
h: nudge size
Returns:
The gradient vector (same shape as w)
"""
# Initialize gradient to all zeros
grad = np.zeros_like(w)
# For each weight, nudge it and see how the loss changes
for i in range(len(w)):
w_pos = w.copy()
w_neg = w.copy()
w_pos[i] += h # Nudge weight i forward
w_neg[i] -= h # Nudge weight i backward
# Central difference for this weight
grad[i] = (loss_fn(w_pos) - loss_fn(w_neg)) / (2 * h)
return grad
# Verify that analytical and numerical gradients match
np.random.seed(42) # For reproducible results
# Create a small dataset: 20 data points with a bias column + 1 feature
X = np.column_stack([np.ones(20), np.random.randn(20)]) # Shape: (20, 2)
# True weights that generated the data
w_true = np.array([2.0, 3.0]) # bias=2, slope=3
# Generate targets with a little noise
y = X @ w_true + np.random.randn(20) * 0.1 # y = 2 + 3x + noise
# Start from w = [0, 0] and compute the gradient
w = np.array([0.0, 0.0])
# MSE loss function (takes w, returns a number)
mse_loss = lambda w: np.mean((y - X @ w)**2)
# Compute both gradients
analytical_grad = compute_gradient(X, y, w)
numerical_grad = numerical_gradient(mse_loss, w)
# Display results
print("Gradient verification (should match closely):")
print(f" Analytical gradient: [{analytical_grad[0]:.8f}, {analytical_grad[1]:.8f}]")
print(f" Numerical gradient: [{numerical_grad[0]:.8f}, {numerical_grad[1]:.8f}]")
print(f" Match: {np.allclose(analytical_grad, numerical_grad)}")
Gradient verification (should match closely): Analytical gradient: [-2.91901361, -4.72563395] Numerical gradient: [-2.91901359, -4.72563395] Match: True
Both components are negative, which means the loss would decrease if we increased both w[0] (bias) and w[1] (slope) — makes sense, since the true values are [2.0, 3.0] and we started at [0.0, 0.0].
# Visualize the gradient on a simple 2D function: f(x, y) = x^2 + y^2
# This is a "bowl" shape — the minimum is at (0, 0).
# Create a grid of x, y values
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X_grid, Y_grid = np.meshgrid(x, y) # Create 2D grid
Z = X_grid**2 + Y_grid**2 # f(x, y) = x^2 + y^2
# Create two side-by-side plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: Filled contour showing function values
contour = ax1.contourf(X_grid, Y_grid, Z, levels=20, cmap='viridis')
plt.colorbar(contour, ax=ax1, label='$f(x, y)$') # Color legend
ax1.set_title('$f(x, y) = x^2 + y^2$ (the "bowl")', fontsize=13)
ax1.set_xlabel('x', fontsize=12)
ax1.set_ylabel('y', fontsize=12)
ax1.plot(0, 0, 'r*', markersize=15,
label='Minimum at (0,0)') # Mark the minimum
ax1.legend(fontsize=10)
# Right plot: Contour lines with negative gradient arrows
ax2.contour(X_grid, Y_grid, Z, levels=15, cmap='viridis', alpha=0.5)
# Create a coarser grid for the arrows (so they don't overlap)
gx = np.linspace(-2.5, 2.5, 8)
gy = np.linspace(-2.5, 2.5, 8)
GX, GY = np.meshgrid(gx, gy)
# The gradient of f(x,y) = x^2 + y^2 is [2x, 2y]
# The NEGATIVE gradient (descent direction) is [-2x, -2y]
U = -2 * GX # x-component of negative gradient
V = -2 * GY # y-component of negative gradient
# Plot arrows showing descent direction
ax2.quiver(GX, GY, U, V, color='red', alpha=0.7, scale=50)
ax2.set_title('Negative Gradient (Descent Direction)', fontsize=13)
ax2.set_xlabel('x', fontsize=12)
ax2.set_ylabel('y', fontsize=12)
ax2.set_aspect('equal')
ax2.plot(0, 0, 'r*', markersize=15) # Mark the minimum
plt.tight_layout()
plt.show()
Notice how every arrow points toward the minimum at $(0, 0)$. The arrows are longer (stronger gradient) when far from the minimum, and shorter (weaker gradient) when close — the bowl is flatter near the bottom.
Gradient Descent — How Models Learn¶
The Hiking-in-Fog Analogy¶
Imagine you're standing on a mountain, and it's extremely foggy. You want to reach the bottom of the valley, but you can't see the whole landscape — you can only feel the ground right under your feet.
Your strategy: feel which direction is steepest downhill, take a step in that direction, then stop and reassess. Repeat. Eventually, you'll reach the bottom.
That's gradient descent! It's the most fundamental optimization algorithm in machine learning.
The Update Rule¶
At each step, we update the weights using this formula:
$$\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \, \nabla_{\mathbf{w}} L$$
Let's carefully define every symbol:
| Symbol | Meaning |
|---|---|
| $\mathbf{w}_t$ | The current weight values at step $t$ (where we are NOW on the mountain) |
| $\nabla_{\mathbf{w}} L$ | The gradient of the loss — points uphill (direction of steepest ascent) |
| $-$ (minus sign) | We go in the opposite direction of the gradient — i.e., downhill |
| $\alpha$ | The learning rate — how big each step is (a small positive number like 0.01) |
| $\mathbf{w}_{t+1}$ | The updated weight values (where we'll be AFTER this step) |
The Learning Rate: Goldilocks Problem¶
The learning rate $\alpha$ is a crucial hyperparameter (a setting you choose, not something the model learns):
- Too small ($\alpha = 0.00001$): Each step is tiny. Training takes forever. You might get stuck in a flat area and give up before reaching the bottom.
- Too large ($\alpha = 1.0$): Each step overshoots the minimum. Instead of settling into the valley, you bounce wildly from side to side — or even fly off the mountain entirely (the loss increases instead of decreasing).
- Just right ($\alpha = 0.001$ or so): Steady, smooth convergence to the minimum. The loss decreases consistently and levels off.
Think of it this way: if the fog is thick and the terrain is steep, you want to take careful, moderate steps. Too timid and you'll never get home. Too reckless and you'll fall off a cliff.
Gradient Descent for Linear Regression — From Scratch¶
# Goal: learn the relationship y = 3x + 2 from noisy data.
# We'll start with random guesses for the slope and intercept,
# then use gradient descent to iteratively improve them.
np.random.seed(42) # Ensures you get the same results every time
# Step 1: Generate synthetic (fake) data
# The TRUE relationship is y = 3x + 2, but we add random noise
# to simulate real-world data that isn't perfectly on a line.
n_samples = 100
x = np.random.rand(n_samples) * 10 # 100 random x values between 0 and 10
noise = np.random.randn(n_samples) * 2 # Random noise (mean=0, std=2)
y = 3 * x + 2 + noise # True relationship + noise
# Step 2: Build the design matrix
# We add a column of 1s for the bias (intercept) term.
# Each row is [1, x_i], so X @ w = w[0]*1 + w[1]*x_i = bias + slope*x
X = np.column_stack([np.ones(n_samples), x]) # Shape: (100, 2)
# Step 3: Initialize weights to zero (could be random)
# w[0] = bias (intercept), w[1] = slope
w = np.array([0.0, 0.0])
# Step 4: Set hyperparameters
alpha = 0.001 # Step size (alpha)
n_iter = 200 # Number of gradient descent steps
# Step 5: Run gradient descent
# Track the loss at each step (for plotting later)
loss_history = []
for i in range(n_iter):
y_hat = X @ w # Compute predictions: y_hat = X @ w
loss = np.mean((y - y_hat)**2) # Compute the MSE loss: L = (1/N) * sum((y - y_hat)^2)
loss_history.append(loss) # Save for plotting
gradient = compute_gradient(X, y, w) # Compute the gradient: grad = -(2/N) * X^T * (y - X*w)
w = w - alpha * gradient # Update the weights: w_new = w_old - alpha * gradient
# Step 6: Print results
print("=" * 50)
print("GRADIENT DESCENT RESULTS")
print("=" * 50)
print(f"After {n_iter} iterations:")
print(f" Learned bias (intercept): {w[0]:.4f} (true value: 2.0000)")
print(f" Learned slope: {w[1]:.4f} (true value: 3.0000)")
print(f" Final MSE loss: {loss_history[-1]:.4f}")
print(f" Initial MSE loss: {loss_history[0]:.4f}")
print(f" Loss reduction: {loss_history[0] - loss_history[-1]:.4f}")
================================================== GRADIENT DESCENT RESULTS ================================================== After 200 iterations: Learned bias (intercept): 0.6962 (true value: 2.0000) Learned slope: 3.1745 (true value: 3.0000) Final MSE loss: 4.0808 Initial MSE loss: 336.6350 Loss reduction: 332.5541
Visualize the training process: loss curve and fitted line.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: Loss over iterations
ax1.plot(loss_history, 'b-', linewidth=2)
ax1.set_xlabel('Iteration', fontsize=12)
ax1.set_ylabel('MSE Loss', fontsize=12)
ax1.set_title('Loss Decreases During Training', fontsize=14)
ax1.grid(True, alpha=0.3)
# Annotate the starting and ending loss
ax1.annotate(f'Start: {loss_history[0]:.1f}',
xy=(0, loss_history[0]),
xytext=(40, loss_history[0] * 0.85),
arrowprops=dict(arrowstyle='->', color='red'),
fontsize=10,
color='red')
ax1.annotate(f'End: {loss_history[-1]:.2f}',
xy=(199, loss_history[-1]),
xytext=(120, loss_history[-1] + 20),
arrowprops=dict(arrowstyle='->', color='green'),
fontsize=10,
color='green')
# Right plot: Data points and fitted line
ax2.scatter(x, y, alpha=0.5, s=25, color='steelblue', label='Data points')
# Draw the fitted line
x_line = np.linspace(0, 10, 100)
y_line = w[0] + w[1] * x_line # y = bias + slope * x
ax2.plot(x_line,
y_line,
'r-',
linewidth=2.5,
label=f'Learned: y = {w[1]:.2f}x + {w[0]:.2f}')
# Also draw the true line for comparison
y_true_line = 2 + 3 * x_line
ax2.plot(x_line,
y_true_line,
'g--',
linewidth=1.5,
alpha=0.7,
label='True: y = 3.00x + 2.00')
ax2.set_xlabel('x', fontsize=12)
ax2.set_ylabel('y', fontsize=12)
ax2.set_title('Linear Regression via Gradient Descent', fontsize=14)
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
The loss curve shows rapid improvement at first, then leveling off — this is typical of gradient descent (big gains early, fine-tuning later). The fitted line (red) closely matches the true relationship (green dashed).
Learning Rate Comparison¶
Let's see what happens with different learning rates.
fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))
# Three learning rates to compare
alphas = [0.00001, 0.001, 0.01]
titles = [
'Too Small ($\\alpha$ = 0.00001)', 'Just Right ($\\alpha$ = 0.001)',
'Too Large ($\\alpha$ = 0.01)'
]
colors = ['#e74c3c', '#2ecc71', '#3498db']
for ax, lr, title, color in zip(axes, alphas, titles, colors):
# Start from the same initial weights each time
w_temp = np.array([0.0, 0.0])
losses = []
# Run 200 iterations of gradient descent
for _ in range(200):
loss = np.mean((y - X @ w_temp)**2) # Compute the MSE loss: L = (1/N) * sum((y - y_hat)^2)
losses.append(loss) # Save for plotting
grad = compute_gradient(X, y, w_temp) # Compute the gradient: grad = -(2/N) * X^T * (y - X*w)
w_temp = w_temp - lr * grad # Update the weights: w_new = w_old - alpha * gradient
# Plot the loss curve
ax.plot(losses, linewidth=2, color=color)
ax.set_title(title, fontsize=12)
ax.set_xlabel('Iteration', fontsize=11)
ax.set_ylabel('MSE Loss', fontsize=11)
ax.grid(True, alpha=0.3)
# Show final loss value
ax.text(0.95,
0.95,
f'Final loss: {losses[-1]:.2f}',
transform=ax.transAxes,
fontsize=9,
verticalalignment='top',
horizontalalignment='right',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
plt.tight_layout()
plt.show()
Observations:
- Too small: Loss barely decreases — learning is way too slow.
- Just right: Smooth, steady decrease — this is what we want!
- Too large: Loss is unstable or diverges — overshooting the minimum.
The Chain Rule in Action: Backpropagation Preview¶
Computational Graphs: Breaking Down Complex Calculations¶
A computational graph breaks down a complex computation into a sequence of simple, elementary steps. Each step performs one basic operation (add, multiply, square, etc.). This is exactly how neural network frameworks like PyTorch and TensorFlow organize their calculations internally.
A Simple Example¶
Let's trace through the loss for a single data point in linear regression:
$$L = (y - (wx + b))^2$$
We can break this into three simple steps:
| Step | Computation | What it does |
|---|---|---|
| 1 | $\hat{y} = wx + b$ | Make a prediction |
| 2 | $e = y - \hat{y}$ | Compute the error (how far off we are) |
| 3 | $L = e^2$ | Compute the loss (squared error, always positive) |
The Forward Pass¶
The forward pass computes the output step by step, left to right (inputs $\to$ output).
For example, with $w = 1$, $x = 2$, $b = 1$, $y = 7$:
- $\hat{y} = 1 \cdot 2 + 1 = 3$
- $e = 7 - 3 = 4$
- $L = 4^2 = 16$
The Backward Pass (Backpropagation)¶
The backward pass computes derivatives step by step, right to left (output $\to$ inputs), using the chain rule.
We want $\dfrac{\partial L}{\partial w}$ — how does the loss change when we tweak $w$?
Using the chain rule, we multiply derivatives along the path from $L$ back to $w$:
$$\dfrac{\partial L}{\partial w} = \dfrac{\partial L}{\partial e} \cdot \dfrac{\partial e}{\partial \hat{y}} \cdot \dfrac{\partial \hat{y}}{\partial w}$$
Let's compute each piece:
| Derivative | Computation | Value (for our example) |
|---|---|---|
| $\dfrac{\partial L}{\partial e} = 2e$ | Derivative of $e^2$ is $2e$ | $2(4) = 8$ |
| $\dfrac{\partial e}{\partial \hat{y}} = -1$ | $e = y - \hat{y}$, so derivative w.r.t. $\hat{y}$ is $-1$ | $-1$ |
| $\dfrac{\partial \hat{y}}{\partial w} = x$ | $\hat{y} = wx + b$, so derivative w.r.t. $w$ is $x$ | $2$ |
Multiplying them all together:
$$\dfrac{\partial L}{\partial w} = 2e \cdot (-1) \cdot x = -2(y - wx - b)x$$
With our numbers: $= 8 \times (-1) \times 2 = -16$.
This tells us: if we increase $w$ slightly, the loss decreases (because the gradient is negative). So we should increase $w$ — which makes sense, since our prediction (3) is too low compared to the target (7).
Similarly for $b$:
$$\dfrac{\partial L}{\partial b} = \dfrac{\partial L}{\partial e} \cdot \dfrac{\partial e}{\partial \hat{y}} \cdot \dfrac{\partial \hat{y}}{\partial b} = 2e \cdot (-1) \cdot 1 = -2(y - wx - b)$$
With our numbers: $= 8 \times (-1) \times 1 = -8$.
The Big Picture¶
This is exactly how PyTorch's autograd and TensorFlow's GradientTape work internally — they build a computational graph during the forward pass and then walk backward through it to compute gradients for every parameter. This procedure, known as backpropagation, scales efficiently to networks with millions (or even billions!) of parameters.
from typing import Tuple
def forward_backward(x: float, y: float, w: float,
b: float) -> Tuple[float, float, float]:
"""Demonstrate forward and backward pass through a computational graph for
the loss: L = (y - (w*x + b))^2.
Args:
x: input value
y: true target value
w: weight parameter
b: bias parameter
Returns:
L: the loss value (forward pass result)
dL_dw: partial derivative of L with respect to w
dL_db: partial derivative of L with respect to b
"""
# FORWARD PASS
y_hat = w * x + b # prediction = weight * input + bias
e = y - y_hat # error = true value - prediction
L = e**2 # loss = error squared
# BACKWARD PASS
# We walk backward through the graph, applying the chain rule.
dL_de = 2 * e # backward: dL/de = 2e (derivative of e^2)
de_dy_hat = -1 # backward: de/dy_hat = -1 (derivative of y - y_hat w.r.t. y_hat)
# backward: dy_hat/dw = x, dy_hat/db = 1
dy_hat_dw = x # derivative of w*x + b w.r.t. w is x
dy_hat_db = 1 # derivative of w*x + b w.r.t. b is 1
# Chain rule: multiply the local derivatives along the path
dL_dw = dL_de * de_dy_hat * dy_hat_dw # dL/dw = 2e * (-1) * x
dL_db = dL_de * de_dy_hat * dy_hat_db # dL/db = 2e * (-1) * 1
return L, dL_dw, dL_db
# Test with concrete values
x_val = 2.0 # Input
y_val = 7.0 # True target
w_val = 1.0 # Current weight
b_val = 1.0 # Current bias
# Run forward and backward pass
loss, grad_w, grad_b = forward_backward(x_val, y_val, w_val, b_val)
# Display results
print("=" * 50)
print("FORWARD PASS")
print("=" * 50)
print(f" y_hat = w*x + b = {w_val}*{x_val} + {b_val} = {w_val*x_val + b_val}")
print(f" error = y - y_hat = {y_val} - {w_val*x_val + b_val} = {y_val - (w_val*x_val + b_val)}")
print(f" Loss = error^2 = {y_val - (w_val*x_val + b_val)}^2 = {loss}")
print()
print("=" * 50)
print("BACKWARD PASS (Gradients via Chain Rule)")
print("=" * 50)
print(f" dL/dw = {grad_w}")
print(f" dL/db = {grad_b}")
# Verify numerically
h = 1e-7
# Numerical dL/dw: nudge w and see how loss changes
loss_w_plus = (y_val - ((w_val + h) * x_val + b_val))**2
loss_w_minus = (y_val - ((w_val - h) * x_val + b_val))**2
numerical_dL_dw = (loss_w_plus - loss_w_minus) / (2 * h)
# Numerical dL/db: nudge b and see how loss changes
loss_b_plus = (y_val - (w_val * x_val + (b_val + h)))**2
loss_b_minus = (y_val - (w_val * x_val + (b_val - h)))**2
numerical_dL_db = (loss_b_plus - loss_b_minus) / (2 * h)
print()
print("=" * 50)
print("NUMERICAL VERIFICATION")
print("=" * 50)
print(f" dL/dw analytical: {grad_w:.6f} numerical: {numerical_dL_dw:.6f} match: {np.isclose(grad_w, numerical_dL_dw)}")
print(f" dL/db analytical: {grad_b:.6f} numerical: {numerical_dL_db:.6f} match: {np.isclose(grad_b, numerical_dL_db)}")
================================================== FORWARD PASS ================================================== y_hat = w*x + b = 1.0*2.0 + 1.0 = 3.0 error = y - y_hat = 7.0 - 3.0 = 4.0 Loss = error^2 = 4.0^2 = 16.0 ================================================== BACKWARD PASS (Gradients via Chain Rule) ================================================== dL/dw = -16.0 dL/db = -8.0 ================================================== NUMERICAL VERIFICATION ================================================== dL/dw analytical: -16.000000 numerical: -16.000000 match: True dL/db analytical: -8.000000 numerical: -8.000000 match: True