Gradient Descent¶
To train a model, we need a good way to reduce the model’s loss. An iterative approach is one widely used method for reducing loss, and is as easy and efficient as walking down a hill.
The MNIST Dataset: Our Training Textbook¶
Before a network can recognize a digit it has never seen, it needs to see thousands of examples. We use the MNIST data set, a collection of $70,000$ handwritten digits.
The data is split into two crucial parts:
- Training Data ($60,000$ images): This is what the network uses to adjust its weights and biases. It’s like the practice problems a student does before an exam.
- Test Data ($10,000$ images): This is the final exam. These images come from a different set of people than the training data, ensuring the network isn't just "memorizing" specific handwriting but is actually learning the general shapes of the numbers.
How the computer "sees" the data: Each $28 \times 28$ image is flattened into a vector $x$ with $784$ components. If the image is a "$6$," we tell the network the "target" or "correct" answer is a vector $y(x)$: $$y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T$$ (The $1$ is at the 7th position, representing the digit 6).
The Cost Function: Measuring "Wrongness"¶
To help the network learn, we need a mathematical way to tell it how far off its guess is from the truth. We define a Cost Function (also called a Loss Function), which we'll denote as $C$.
We use the Quadratic Cost (or Mean Squared Error): $$C(w,b) \equiv \frac{1}{2n} \sum_{x} \|y(x) - a\|^2 \tag{1}$$
A Beginner's Breakdown of the Formula:
- $w$ and $b$ are the weights and biases (the "knobs" we can turn).
- $n$ is the number of training images.
- $y(x)$ is the correct answer (the target).
- $a$ is the network's actual output (the guess).
- $\|y(x) - a\|^2$ is simply the square of the distance between the guess and the truth.
Why this works: If the network is perfect, $a$ will equal $y(x)$, and the cost $C$ will be $0$. If the network is totally wrong, $C$ will be a large number. Learning, therefore, is simply the process of finding weights and biases that make $C$ as small as possible.
Gradient Descent: Rolling Downhill¶
How do we find that minimum point where the cost is lowest? Imagine the cost function is a physical landscape—a rugged valley.
If you place a ball at a random spot on the side of this valley, it will naturally roll down toward the bottom. This is the intuition behind Gradient Descent.
Why not just use high-school calculus?¶
In a simple algebra problem, you find the minimum by setting the derivative to zero. But our neural network might have millions of weights. Trying to solve that many equations at once is a computational nightmare. Instead, we use an iterative approach: we take a small step "downhill," recalculate our position, and repeat.
The "Law of Motion"¶
To know which way is "down," we calculate the Gradient, denoted by $\nabla C$. Think of the gradient as a compass that points exactly uphill. Since we want to go down, we move in the opposite direction: $$\Delta v = -\eta \nabla C \tag{2}$$
- $\nabla C$: The gradient (direction of steepest increase).
- $\eta$ (Eta): The Learning Rate. This is a small positive number that determines how big of a "step" we take.
The Update Rule: Every time we take a step, we update our position (our weights and biases) like this: $$v \to v' = v - \eta \nabla C \tag{3}$$
The "Goldilocks" Learning Rate¶
Choosing the right $\eta$ (learning rate) is the "Art of AI":
- If $\eta$ is too large: The ball might bounce right over the bottom of the valley and end up higher on the other side! The network will never learn.
- If $\eta$ is too small: The ball moves so slowly that it would take years to reach the bottom.
Intuition Check: Gradient descent doesn't have "momentum" like a real bowling ball. It is more like a cautious hiker in a fog who can only see the ground directly beneath their feet. They feel the slope with their boots and take one small step in the steepest downward direction. By repeating this thousands of times, they eventually reach the valley floor.
Stochastic Gradient Descent — The "Shortcut" to Learning¶
We’ve seen that Gradient Descent is like a ball rolling down a valley. In a perfect world, the ball calculates the slope of the entire valley before taking a single step. However, when dealing with modern data, that "perfect" approach becomes a massive bottleneck.
The Problem with "Perfect" Physics¶
Some researchers have tried to make the ball act exactly like a real physical object with momentum and friction. While this sounds great, it requires calculating second partial derivatives.
Why is this a problem for beginners? If your network has $1,000,000$ weights (which is small by modern standards), calculating the "slope of the slope" would require roughly a trillion calculations for every single step! It’s computationally exhausting. For this reason, we stick to the "first-order" gradient descent, which only looks at the immediate slope beneath our feet.
The Bottleneck of Big Data¶
Even without the complex physics, we have a data problem. To calculate the true gradient $\nabla C$, the math requires us to:
- Take every single image in the $60,000$-item MNIST set.
- Calculate the error for each one.
- Average them all together.
Only then do we take one tiny step. If we have to do this thousands of times, the network will take forever to learn.
The Solution: Stochastic Gradient Descent (SGD)¶
Think of Stochastic Gradient Descent like political polling. If you want to know how an entire country feels about an issue, you don't ask all 300 million citizens; you ask a representative sample of $1,000$ people.
In SGD, we pick a small, random sample of our training data called a mini-batch (denoted by $m$). We calculate the gradient for just those $10$ or $100$ images and use that as an estimate for the whole group.
$$\nabla C \approx \frac{1}{m} \sum_{j=1}^{m} \nabla C_{X_j} \tag{1}$$
Why this works: While the estimate isn't perfect—it might be a bit "noisy" or "wobbly"—it points in the generally correct direction.
- The Speedup: If we use a mini-batch of $10$ images instead of the full $60,000$, we are estimating the gradient $6,000$ times faster.
- The Trade-off: The ball might zig-zag a bit as it rolls down the hill because each mini-batch is slightly different, but it still reaches the bottom of the valley much sooner.
The Update Rule in Practice¶
When we apply this to our weights ($w$) and biases ($b$), our learning rules become:
$$w_k \to w_k' = w_k - \frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{2}$$ $$b_l \to b_l' = b_l - \frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \tag{3}$$
Terminology to Remember:
- Mini-batch: The small group of random images (e.g., $10$ images) used for one update.
- Epoch: One full pass through the entire dataset. If we have $60,000$ images and a mini-batch size of $10$, one epoch consists of $6,000$ steps. After one epoch, we shuffle the data and start again.
Analogy for the Road: Standard Gradient Descent is like a person who refuses to take a step until they have surveyed the entire mountain range. Stochastic Gradient Descent is like a person who looks at the three feet of ground directly in front of them, takes a quick step, and repeats. The second person will be a bit "shaky," but they'll reach the finish line while the first person is still unfolding their map.
The "Dimension" Dilemma — How to Think Like a Mathematician¶
As we wrap up our discussion on Gradient Descent, it is common to feel a sense of mental vertigo. We’ve been using the analogy of a ball rolling down a 3D valley, but in a real neural network, we aren't dealing with two or three variables ($v_1, v_2$). We are dealing with millions of weights and biases.
This often leads to a common panic for beginners: "How am I supposed to visualize a million-dimensional surface?"
The Secret of High-Dimensional Thinking¶
If you feel like you can’t "see" four dimensions—let alone five million—don't worry. You aren't missing a "superpower" that professional mathematicians have. The truth is, almost no one can truly visualize a four-dimensional object in their mind's eye.
Instead of trying to "see" the space, experts use a different strategy: They build a library of representations.
Algebra as a Visual Substitute¶
Think back to our "algebraic trick" from earlier. We didn't need to see the valley to know how to move. We simply looked at the sign of the gradient and moved in the opposite direction.
- In 3D: You look at the slope and step down.
- In 5,000,000D: You calculate the gradient vector $\nabla C$ and subtract a fraction of it from your current position.
The math works exactly the same way regardless of the number of dimensions. Whether you are in a backyard or a hyper-dimensional galaxy, the rule $v \to v - \eta \nabla C$ always points "down."
Developing Your Mental Library¶
Thinking in high dimensions is less about "vision" and more about analogies and patterns.
- Slicing: Imagine taking a 3D object and looking at it one 2D "slice" at a time. Mathematicians do the same with high dimensions by fixing all variables except two and seeing how the function behaves.
- Projection: We can "squash" high-dimensional data down into a 2D plot to see clusters or trends, even if the individual points technically have thousands of coordinates.
The Takeaway: Don't let the "millions of dimensions" scare you. The beauty of the gradient descent algorithm is that it treats every dimension with the same simple logic. You don't need to visualize the mountain to know which way is downhill; you just need to feel the slope beneath your feet.