Neural Networks¶

Neural networks are a family of model architectures designed to find nonlinear patterns in data. We'll see what this means by solving a problem of recognizing handwritten digits with a neural network.

Recognizing Handwritten Digits¶

Consider the following sequence of handwritten digits:

Most people effortlessly recognize those digits as $504192$. In each hemisphere of our brain, humans have a primary visual cortex, also known as $V_1$, containing $140$ million neurons, with tens of billions of connections between them. And yet human vision involves not just $V_1$, but an entire series of visual cortices - $V_2$, $V_3$, $V_4$, and $V_5$ - doing progressively more complex image processing. We carry in our heads a supercomputer, tuned by evolution over hundreds of millions of years, and superbly adapted to understand the visual world.

The difficulty of visual pattern recognition becomes apparent if you attempt to write a computer program to recognize digits like those above. What seems easy when we do it ourselves suddenly becomes extremely difficult. Simple intuitions about how we recognize shapes - "a $9$ has a loop at the top, and a vertical stroke in the bottom right" - turn out to be not so simple to express algorithmically. When you try to make such rules precise you get lost in a confused situation of exceptions and special cases.

Neural networks approach the problem in a different way. The idea is to take a large number of handwritten digits, known as training examples,

No description has been provided for this image

and then develop a system which can learn from those training examples. In other words, the neural network uses the examples to automatically infer rules for recognizing handwritten digits. Furthermore, by increasing the number of training examples, the network can learn more about handwriting, and so improve its accuracy but that will require millions or billions of training examples.

Perceptron¶

To understand a neural network we need to understand what's a neuron. Perceptron is a type of neuron. Perceptrons were developed in the $1950\text{s}$ and $1960\text{s}$ by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts.

So how do perceptrons work? A perceptron takes several binary inputs, $x_1, x_2, \dots$, and produces a single binary output:

In the example shown the perceptron has three inputs, $x_1, x_2, x_3$. In general it could have more or fewer inputs. Rosenblatt proposed a simple rule to compute the output. He introduced weights, $w_1, w_2, \dots$, real numbers expressing the importance of the respective inputs to the output. The neuron's output, $0$ or $1$, is determined by whether the weighted sum $\sum_{j} w_j x_j$ is less than or greater than some threshold value. Just like the weights, the threshold is a real number which is a parameter of the neuron. To put it in more precise algebraic terms:

$$ \text{output} = \begin{cases} 0 & \text{if } \sum_j w_j x_j \leq \text{threshold} \\ 1 & \text{if } \sum_j w_j x_j > \text{threshold} \end{cases} $$

That's all there is to how a perceptron works!

In this network, the first column of perceptrons - what we'll call the first layer of perceptrons - is making three very simple decisions, by weighing the input evidence. What about the perceptrons in the second layer? Each of those perceptrons is making a decision by weighing up the results from the first layer of decision-making. In this way a perceptron in the second layer can make a decision at a more complex and more abstract level than perceptrons in the first layer. And even more complex decisions can be made by the perceptron in the third layer. In this way, a many-layer network of perceptrons can engage in sophisticated decision making.

When I defined perceptrons I said that a perceptron has just a single output. In the network above the perceptrons look like they have multiple outputs. In fact, they're still single output. The multiple output arrows are merely a useful way of indicating that the output from a perceptron is being used as the input to several other perceptrons.

Let's simplify the way we describe perceptrons. The condition $\sum_{j} w_j x_j > \text{threshold}$ is cumbersome, and we can make two notational changes to simplify it. The first change is to write $\sum_{j} w_j x_j$ as a dot product, $w \cdot x \equiv \sum_{j} w_j x_j$, where $w$ and $x$ are vectors whose components are the weights and inputs, respectively. The second change is to move the threshold to the other side of the inequality, and to replace it by what's known as the perceptron's bias, $b \equiv -\text{threshold}$. Using the bias instead of the threshold, the perceptron rule can be rewritten:

$$ \text{output} = \begin{cases} 0 & \text{if } w \cdot x + b \leq 0 \\ 1 & \text{if } w \cdot x + b > 0 \end{cases} $$

I've described perceptrons as a method for weighing evidence to make decisions. Another way perceptrons can be used is to compute the elementary logical functions we usually think of as underlying computation, functions such as AND, OR, and NAND.

For example, suppose we have a perceptron with two inputs, each with weight $-2$, and an overall bias of $3$. Here's our perceptron:

Input $x_1$: weight $-2$
Input $x_2$: weight $-2$
Bias $b$: $3$

Then we see that input $00$ produces output $1$, since $(-2)*0 + (-2)*0 + 3 = 3$ is positive. Here, I've introduced the $*$ symbol to make the multiplications explicit. Similar calculations show that the inputs $01$ and $10$ produce output $1$. But the input $11$ produces output $0$, since $(-2)*1 + (-2)*1 + 3 = -1$ is negative. And so our perceptron implements a NAND gate!

Input Perceptrons¶

There's are special types of perceptrons called input perceptrons, in which we have an output, but no inputs,

is a shorthand. It doesn't actually mean a perceptron with no inputs. To see this, suppose we did have a perceptron with no inputs. Then the weighted sum $\sum_{j} w_j x_j$ would always be zero, and so the perceptron would output $1$ if $b > 0$, and $0$ if $b \leq 0$.

That is, the perceptron would simply output a fixed value, not the desired value ($x_1$, in the example above). It's better to think of the input perceptrons as not really being perceptrons at all, but rather special units which are simply defined to output the desired values, $x_1, x_2, \dots$.

Sigmoid Neurons¶

Suppose we have a network of perceptrons that we'd like to use to learn to solve some problem. For example, the inputs to the network might be the raw pixel data from a scanned, handwritten image of a digit. And we'd like the network to learn weights and biases so that the output from the network correctly classifies the digit. To see how learning might work, suppose we make a small change in some weight (or bias) in the network. What we'd like is for this small change in weight to cause only a small corresponding change in the output from the network. As we'll see in a moment, this property will make learning possible. Schematically, here's what we want (obviously this network is too simple to do handwriting recognition!):

If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want. For example, suppose the network was mistakenly classifying an image as an "$8$" when it should be a "$9$". We could figure out how to make a small change in the weights and biases so the network gets a little closer to classifying the image as a "$9$". And then we'd repeat this, changing the weights and biases over and over to produce better and better output. The network would be learning.

The problem is that this isn't what happens when our network contains perceptrons. In fact, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from $0$ to $1$. That flip may then cause the behaviour of the rest of the network to completely change in some very complicated way. So while your "$9$" might now be classified correctly, the behaviour of the network on all the other images is likely to have completely changed in some hard-to-control way. That makes it difficult to see how to gradually modify the weights and biases so that the network gets closer to the desired behaviour. Perhaps there's some clever way of getting around this problem. But it's not immediately obvious how we can get a network of perceptrons to learn.

We can overcome this problem by introducing a new type of artificial neuron called a sigmoid neuron. Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output. That's the crucial fact which will allow a network of sigmoid neurons to learn.

Okay, let me describe the sigmoid neuron. We'll depict sigmoid neurons in the same way we depicted perceptrons:

Just like a perceptron, the sigmoid neuron has inputs, $x_1, x_2, \dots$. But instead of being just $0$ or $1$, these inputs can also take on any values between $0$ and $1$. So, for instance, $0.638\dots$ is a valid input for a sigmoid neuron. Also just like a perceptron, the sigmoid neuron has weights for each input, $w_1, w_2, \dots$, and an overall bias, $b$. But the output is not $0$ or $1$. Instead, it's $\sigma(w \cdot x + b)$, where $\sigma$ is called the **sigmoid function***:

Incidentally, $\sigma$ is sometimes called the logistic function, and this new class of neurons called logistic neurons. It's useful to remember this terminology, since these terms are used by many people working with neural nets. However, we'll stick with the sigmoid terminology.

and is defined by: $$\sigma(z) \equiv \frac{1}{1 + e^{-z}} \tag{3}$$

To put it all a little more explicitly, the output of a sigmoid neuron with inputs $x_1, x_2, \dots$, weights $w_1, w_2, \dots$, and bias $b$ is: $$\frac{1}{1 + \exp(-\sum_{j} w_j x_j - b)} \tag{4}$$

At first sight, sigmoid neurons appear very different to perceptrons. The algebraic form of the sigmoid function may seem opaque and forbidding if you're not already familiar with it. In fact, there are many similarities between perceptrons and sigmoid neurons, and the algebraic form of the sigmoid function turns out to be more of a technical detail than a true barrier to understanding.

To understand the similarity to the perceptron model, suppose $z \equiv w \cdot x + b$ is a large positive number. Then $e^{-z} \approx 0$ and so $\sigma(z) \approx 1$. In other words, when $z = w \cdot x + b$ is large and positive, the output from the sigmoid neuron is approximately $1$, just as it would have been for a perceptron. Suppose on the other hand that $z = w \cdot x + b$ is very negative. Then $e^{-z} \to \infty$, and $\sigma(z) \approx 0$. So when $z = w \cdot x + b$ is very negative, the behaviour of a sigmoid neuron also closely approximates a perceptron. It's only when $w \cdot x + b$ is of modest size that there's much deviation from the perceptron model.

What about the algebraic form of $\sigma$? How can we understand that? In fact, the exact form of $\sigma$ isn't so important — what really matters is the shape of the function when plotted. Here's the shape:

This shape is a smoothed out version of a step function:

If $\sigma$ had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be $1$ or $0$ depending on whether $w \cdot x + b$ was positive or negative**.

Actually, when $w \cdot x + b = 0$ the perceptron outputs $0$, while the step function outputs $1$. So, strictly speaking, we'd need to modify the step function at that one point. But you get the idea.

By using the actual $\sigma$ function we get, as already implied above, a smoothed out perceptron. Indeed, it's the smoothness of the $\sigma$ function that is the crucial fact, not its detailed form. The smoothness of $\sigma$ means that small changes $\Delta w_j$ in the weights and $\Delta b$ in the bias will produce a small change $\Delta\text{output}$ in the output from the neuron. In fact, calculus tells us that $\Delta\text{output}$ is well approximated by:

$$\Delta\text{output} \approx \sum_{j} \frac{\partial\text{output}}{\partial w_j} \Delta w_j + \frac{\partial\text{output}}{\partial b} \Delta b \tag{5}$$

where the sum is over all the weights, $w_j$, and $\partial\text{output}/\partial w_j$ and $\partial\text{output}/\partial b$ denote partial derivatives of the $\text{output}$ with respect to $w_j$ and $b$, respectively. Don't panic if you're not comfortable with partial derivatives! While the expression above looks complicated, it's actually saying something very simple (and which is very good news): $\Delta\text{output}$ is a linear function of the changes $\Delta w_j$ and $\Delta b$ in the weights and bias. This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output. So while sigmoid neurons have much of the same qualitative behaviour as perceptrons, they make it much easier to figure out how changing the weights and biases will change the output.

Anatomy of a Neural Network¶

Network Topology: Understanding the Layers¶

When we organize neurons into a functional system, we group them into Layers. Each layer serves a specific purpose in the pipeline of data processing.

The Input Layer¶

The leftmost layer is the Input Layer. Its neurons—known as input neurons—don't perform calculations; they simply hold the raw data we want the network to process.

Design Example: If you are analyzing a $64 \times 64$ grayscale image, you would need $4,096$ input neurons ($64 \times 64 = 4,096$). Each neuron represents the intensity of one pixel, scaled between $0$ and $1$.

The Output Layer¶

The rightmost layer is the Output Layer. This is where the network provides its final answer.

Decision Making: In a "9-detector" network, a single output neuron might suffice. An output $> 0.5$ could indicate the image is a "$9$," while an output $\leq 0.5$ suggests it is not.

The Hidden Layers¶

Any layer sitting between the input and output is a Hidden Layer. Despite the name, there is nothing "mysterious" about them—they are simply internal processing stages that are neither the raw input nor the final result.

Architecture: While some networks have a single hidden layer, others use multiple layers to capture more complex patterns. These are sometimes historically referred to as Multilayer Perceptrons (MLPs), even when they are actually built using sigmoid neurons.

Designing for Performance¶

While the input and output layers are usually determined by your specific problem (like image size or number of categories), designing Hidden Layers is an art form. Researchers use various "heuristics" (rules of thumb) to decide:

Depth: How many hidden layers are needed to capture the logic?
Breadth: How many neurons should be in each layer?
Trade-offs: Balancing the complexity of the layers against the time and computational power required for training.

Flow of Information: Feedforward vs. Recurrent¶

Feedforward Networks¶

The most common architecture is the Feedforward Neural Network. In this model, information travels in one direction: from the input, through the hidden layers, to the output. There are no loops.

Why no loops? If an output fed back into its own input instantaneously, the $\sigma$ function would become a mathematical paradox.

Recurrent Neural Networks (RNNs)¶

In contrast, Recurrent Neural Networks allow for feedback loops. These models are inspired by the temporal nature of the human brain:

Time-Delayed Firing: A neuron's output can affect its own input, but only at a later point in time.
Use Case: This makes RNNs powerful for processing sequences, like speech or text, where the order of information matters.

Our Focus: While RNNs are a fascinating area of study, the majority of modern foundational learning is built on Feedforward Networks. We will concentrate on these widely-used models as we move into training and optimization.

A Simple Digit Classifier Network¶

Recognizing digits is a "Grand Challenge" because, while it feels instantaneous to us, it is a multi-step miracle of biological processing. To replicate this in a machine, we must first split the problem into two distinct sub-tasks.

The Segmentation Problem¶

Before a computer can tell you what a number is, it has to find it. If you have a string of digits—like a zip code on an envelope—the computer first needs to "cut" that image into individual boxes, each containing exactly one digit.

Interestingly, we won't spend much time coding this part. Why? Because the Segmentation Problem is actually much easier to solve once you have a world-class Classification engine. Imagine a program that tries a hundred different ways to cut an image; it keeps the version where the classifier says, "I am $99\%$ sure these are all individual numbers," and throws away the version where the classifier is confused. Since the classifier is the "brain" of the operation, we will focus our energy there.

Designing the Network Architecture¶

To identify individual digits, we will use a classic Three-Layer Feedforward Network.

The Input Layer ($784$ Neurons): Our raw data comes from scanned digits that are $28 \times 28$ pixels. If you multiply those dimensions, you get $784$ total pixels. Each pixel becomes a neuron. We scale the "ink" intensity: $0.0$ for a clean white pixel, $1.0$ for solid black, and anything in between for those soft gray edges.
The Hidden Layer ($n$ Neurons): This is the "black box" where the magic happens. We’ll call the number of neurons here $n$. We might start with $n=15$. This layer’s job is to look for "clues"—not pixels, but features.
The Output Layer ($10$ Neurons): We want the network to pick a category from $0$ to $9$. We assign one neuron to each digit. If the neuron at index $6$ has the highest activation (closest to $1$), the network is shouting, "I think this is a $6$!"

Why Not Use Binary? (The Efficiency Paradox)¶

A common question arises: "Since $4$ neurons can represent $16$ different states ($2^4 = 16$), couldn't we just use $4$ output neurons to represent the digits $0$ through $9$?" It seems more efficient, right?

The answer lies in how the network actually "thinks" through Feature Detection.

Imagine the hidden layer is looking for specific geometric components. One neuron might fire if it sees a circular loop in the top half of the image. Another fires if it sees a horizontal stroke at the top.

The 10-Output Logic: If the "top loop" neuron fires AND the "right-side vertical stroke" neuron fires, the output neuron for $9$ has a very simple job: it just weighs that evidence and turns "on."
The 4-Output Logic: If we used only $4$ neurons, the first neuron would have to decide something abstract, like "Is the most significant bit of this digit's binary code a $1$?"

There is no "shape" for a binary bit. By using $10$ neurons, we align the network's architecture with the physical geometry of the problem. It turns out that a network that can "see" a loop as part of a $9$ learns much faster and more accurately than one trying to calculate binary code from scratch.

Note: Before moving to the next section you must learn about Gradient Descent.

Implementing our Network to Classify Digits¶

Alright, let's write a program that learns how to recognize handwritten digits using stochastic gradient descent and the MNIST training data. We'll do this with a short Python program—just 74 lines of core code!

In [1]:

Copied!





import random
from typing import List, Tuple, Optional
import numpy as np


class Network(object):
  """A module to implement the stochastic gradient descent learning algorithm.

  This class represents a feedforward neural network. It provides methods
  for initialization, feedforward passes, and training via mini-batch 
  stochastic gradient descent.
  """

  def __init__(self, sizes: List[int]):
    """Initializes the neural network with random weights and biases.

    The weights and biases are initialized using a Gaussian distribution 
    (mean 0, variance 1). Note that we don't set biases for the first layer, 
    as it serves solely as the input layer.

    Args:
      sizes: A list containing the number of neurons in the respective layers.
        For example, [784, 30, 10] defines a 3-layer network where the input 
        layer has 784 neurons, the hidden layer has 30, and the output layer 
        has 10.
    """
    self.num_layers = len(sizes)
    self.sizes = sizes
    # Biases are stored as a list of column vectors for each layer except the first
    self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
    # Weights are stored as a list of matrices connecting adjacent layers
    # weights[0] connects layer 0 to layer 1, etc.
    self.weights = [
        np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])
    ]
import random
from typing import List, Tuple, Optional
import numpy as np


class Network(object):
  """A module to implement the stochastic gradient descent learning algorithm.

  This class represents a feedforward neural network. It provides methods
  for initialization, feedforward passes, and training via mini-batch 
  stochastic gradient descent.
  """

  def __init__(self, sizes: List[int]):
    """Initializes the neural network with random weights and biases.

    The weights and biases are initialized using a Gaussian distribution 
    (mean 0, variance 1). Note that we don't set biases for the first layer, 
    as it serves solely as the input layer.

    Args:
      sizes: A list containing the number of neurons in the respective layers.
        For example, [784, 30, 10] defines a 3-layer network where the input 
        layer has 784 neurons, the hidden layer has 30, and the output layer 
        has 10.
    """
    self.num_layers = len(sizes)
    self.sizes = sizes
    # Biases are stored as a list of column vectors for each layer except the first
    self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
    # Weights are stored as a list of matrices connecting adjacent layers
    # weights[0] connects layer 0 to layer 1, etc.
    self.weights = [
        np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])
    ]

What’s happening here? We use np.random.randn to give the network a random starting point. We don't set biases for the first layer because the first layer is just the input (the pixels); it doesn't "calculate" anything itself.

Notice the weight matrix dimensions: if layer 1 has $x$ neurons and layer 2 has $y$ neurons, the weight matrix is $y \times x$. This allows us to use the elegant matrix equation: $$a' = \sigma(wa + b)$$

The Feedforward Mechanism¶

The feedforward method takes an input $a$ and calculates the output by moving through each layer.

In [2]:

Copied!





def sigmoid(z: np.ndarray) -> np.ndarray:
  """Computes the sigmoid function for an input scalar or array.

  The sigmoid function is defined as 1 / (1 + exp(-z)). When z is a 
  numpy array, the operation is performed element-wise.

  Args:
    z: A scalar value or a numpy array representing the input.

  Returns:
    A numpy array or scalar containing the sigmoid activation values.
  """
  return 1.0 / (1.0 + np.exp(-z))


def feedforward(self, a: np.ndarray) -> np.ndarray:
  """Calculates the output of the network for a given input.

  This method applies the activation function layer-by-layer to transform
  the input activations into the final output activations.

  Args:
    a: An (n, 1) numpy ndarray representing the input activations (pixels).

  Returns:
    An (m, 1) numpy ndarray representing the activations of the output layer.
  """
  for b, w in zip(self.biases, self.weights):
    # Apply the weight matrix, add bias, and pass through the sigmoid
    a = sigmoid(np.dot(w, a) + b)
  return a
def sigmoid(z: np.ndarray) -> np.ndarray:
  """Computes the sigmoid function for an input scalar or array.

  The sigmoid function is defined as 1 / (1 + exp(-z)). When z is a 
  numpy array, the operation is performed element-wise.

  Args:
    z: A scalar value or a numpy array representing the input.

  Returns:
    A numpy array or scalar containing the sigmoid activation values.
  """
  return 1.0 / (1.0 + np.exp(-z))


def feedforward(self, a: np.ndarray) -> np.ndarray:
  """Calculates the output of the network for a given input.

  This method applies the activation function layer-by-layer to transform
  the input activations into the final output activations.

  Args:
    a: An (n, 1) numpy ndarray representing the input activations (pixels).

  Returns:
    An (m, 1) numpy ndarray representing the activations of the output layer.
  """
  for b, w in zip(self.biases, self.weights):
    # Apply the weight matrix, add bias, and pass through the sigmoid
    a = sigmoid(np.dot(w, a) + b)
  return a

Training with Stochastic Gradient Descent¶

The SGD method is the heart of the learning process. It shuffles the data, breaks it into mini-batches, and updates the network's internal "knobs" (weights and biases).

In [3]:

Copied!





def SGD(self,
        training_data: List[Tuple[np.ndarray, np.ndarray]],
        epochs: int,
        mini_batch_size: int,
        eta: float,
        test_data: Optional[List[Tuple[np.ndarray, int]]] = None) -> None:
  """Trains the neural network using mini-batch stochastic gradient descent.

  The algorithm shuffles the training data each epoch to ensure variety in 
  mini-batches, then updates weights and biases for each batch.

  Args:
    training_data: A list of tuples (x, y) where x is the input image and 
      y is the desired output vector.
    epochs: The total number of complete passes over the training dataset.
    mini_batch_size: The number of training examples used to estimate the 
      gradient in each step.
    eta: The learning rate, which controls the size of the weight updates.
    test_data: An optional list of tuples (x, y) for evaluating performance 
      after each epoch. If provided, progress is printed to the console.
  """
  if test_data:
    n_test = len(test_data)
  n = len(training_data)
  for j in range(epochs):
    # Shuffle the data to prevent the network from learning order patterns
    random.shuffle(training_data)
    # Partition the shuffled data into manageable chunks
    mini_batches = [
        training_data[k:k + mini_batch_size]
        for k in range(0, n, mini_batch_size)
    ]
    for mini_batch in mini_batches:
      self.update_mini_batch(mini_batch, eta)
    if test_data:
      print("Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test))
    else:
      print("Epoch {0} complete".format(j))
def SGD(self,
        training_data: List[Tuple[np.ndarray, np.ndarray]],
        epochs: int,
        mini_batch_size: int,
        eta: float,
        test_data: Optional[List[Tuple[np.ndarray, int]]] = None) -> None:
  """Trains the neural network using mini-batch stochastic gradient descent.

  The algorithm shuffles the training data each epoch to ensure variety in 
  mini-batches, then updates weights and biases for each batch.

  Args:
    training_data: A list of tuples (x, y) where x is the input image and 
      y is the desired output vector.
    epochs: The total number of complete passes over the training dataset.
    mini_batch_size: The number of training examples used to estimate the 
      gradient in each step.
    eta: The learning rate, which controls the size of the weight updates.
    test_data: An optional list of tuples (x, y) for evaluating performance 
      after each epoch. If provided, progress is printed to the console.
  """
  if test_data:
    n_test = len(test_data)
  n = len(training_data)
  for j in range(epochs):
    # Shuffle the data to prevent the network from learning order patterns
    random.shuffle(training_data)
    # Partition the shuffled data into manageable chunks
    mini_batches = [
        training_data[k:k + mini_batch_size]
        for k in range(0, n, mini_batch_size)
    ]
    for mini_batch in mini_batches:
      self.update_mini_batch(mini_batch, eta)
    if test_data:
      print("Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test))
    else:
      print("Epoch {0} complete".format(j))

To update a mini-batch, we use an algorithm called Backpropagation (which we will study in detail in the next chapter). It calculates the gradient (the direction of "steepest descent") for us.

In [4]:

Copied!





def update_mini_batch(self, mini_batch: List[Tuple[np.ndarray, np.ndarray]],
                      eta: float) -> None:
  """Updates weights and biases by applying gradient descent to a mini-batch.

  This method computes the gradient for each example in the mini-batch 
  using backpropagation, then updates the network's parameters once per batch.

  Args:
    mini_batch: A list of tuples (x, y) representing a subset of the training 
      data.
    eta: The learning rate, used to scale the gradient updates.
  """
  # Initialize gradient accumulators as zero matrices
  nabla_b = [np.zeros(b.shape) for b in self.biases]
  nabla_w = [np.zeros(w.shape) for w in self.weights]
  for x, y in mini_batch:
    # Use backpropagation to find the gradient for a single example
    delta_nabla_b, delta_nabla_w = self.backprop(x, y)
    # Sum up the gradients over the mini-batch
    nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
    nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
  # Update weights and biases using the averaged gradient and learning rate
  self.weights = [
      w - (eta / len(mini_batch)) * nw for w, nw in zip(self.weights, nabla_w)
  ]
  self.biases = [
      b - (eta / len(mini_batch)) * nb for b, nb in zip(self.biases, nabla_b)
  ]
def update_mini_batch(self, mini_batch: List[Tuple[np.ndarray, np.ndarray]],
                      eta: float) -> None:
  """Updates weights and biases by applying gradient descent to a mini-batch.

  This method computes the gradient for each example in the mini-batch 
  using backpropagation, then updates the network's parameters once per batch.

  Args:
    mini_batch: A list of tuples (x, y) representing a subset of the training 
      data.
    eta: The learning rate, used to scale the gradient updates.
  """
  # Initialize gradient accumulators as zero matrices
  nabla_b = [np.zeros(b.shape) for b in self.biases]
  nabla_w = [np.zeros(w.shape) for w in self.weights]
  for x, y in mini_batch:
    # Use backpropagation to find the gradient for a single example
    delta_nabla_b, delta_nabla_w = self.backprop(x, y)
    # Sum up the gradients over the mini-batch
    nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
    nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
  # Update weights and biases using the averaged gradient and learning rate
  self.weights = [
      w - (eta / len(mini_batch)) * nw for w, nw in zip(self.weights, nabla_w)
  ]
  self.biases = [
      b - (eta / len(mini_batch)) * nb for b, nb in zip(self.biases, nabla_b)
  ]

The Full Listing¶

Here is the complete implementation of the Network class.

In [5]:

Copied!





import random
from typing import List, Tuple, Optional
import numpy as np


class Network(object):
  """A module to implement stochastic gradient descent for a neural network.

  This class implements the core logic for a feedforward neural network, 
  including initialization, feedforward passes, and training via 
  backpropagation.
  """

  def __init__(self, sizes: List[int]):
    """Initializes the network architecture with random weights and biases.

    Args:
      sizes: List of integers representing neurons per layer.
    """
    self.num_layers = len(sizes)
    self.sizes = sizes
    self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
    self.weights = [
        np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])
    ]

  def feedforward(self, a: np.ndarray) -> np.ndarray:
    """Computes output activations for a given input vector.

    Args:
      a: Input activation vector.

    Returns:
      Output activation vector.
    """
    for b, w in zip(self.biases, self.weights):
      a = sigmoid(np.dot(w, a) + b)
    return a

  def SGD(self,
          training_data: List[Tuple[np.ndarray, np.ndarray]],
          epochs: int,
          mini_batch_size: int,
          eta: float,
          test_data: Optional[List[Tuple[np.ndarray, int]]] = None) -> None:
    """Trains the network using mini-batch stochastic gradient descent.

    Args:
      training_data: List of (input, target) pairs.
      epochs: Total training passes through the data.
      mini_batch_size: Number of samples per weight update.
      eta: Learning rate.
      test_data: Data for post-epoch evaluation.
    """
    if test_data:
      n_test = len(test_data)
    n = len(training_data)
    for j in range(epochs):
      random.shuffle(training_data)
      mini_batches = [
          training_data[k:k + mini_batch_size]
          for k in range(0, n, mini_batch_size)
      ]
      for mini_batch in mini_batches:
        self.update_mini_batch(mini_batch, eta)
      if test_data:
        print("Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data),
                                            n_test))
      else:
        print("Epoch {0} complete".format(j))

  def update_mini_batch(self, mini_batch: List[Tuple[np.ndarray, np.ndarray]],
                        eta: float) -> None:
    """Updates parameters based on gradient descent over a mini-batch.

    Args:
      mini_batch: Subset of training data.
      eta: Learning rate.
    """
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    for x, y in mini_batch:
      delta_nabla_b, delta_nabla_w = self.backprop(x, y)
      nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
      nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
    self.weights = [
        w - (eta / len(mini_batch)) * nw
        for w, nw in zip(self.weights, nabla_w)
    ]
    self.biases = [
        b - (eta / len(mini_batch)) * nb for b, nb in zip(self.biases, nabla_b)
    ]

  def backprop(self, x: np.ndarray,
               y: np.ndarray) -> Tuple[List[np.ndarray], List[np.ndarray]]:
    """Calculates the gradient of the cost function for a single example.

    Args:
      x: Input vector.
      y: Target activation vector.

    Returns:
      A tuple containing lists of gradient vectors/matrices for biases/weights.
    """
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    # Feedforward pass to store all intermediate activations
    activation = x
    activations = [x]  # Store activations layer by layer
    zs = []  # Store weighted inputs layer by layer
    for b, w in zip(self.biases, self.weights):
      z = np.dot(w, activation) + b
      zs.append(z)
      activation = sigmoid(z)
      activations.append(activation)
    # Backward pass to compute errors and gradients
    # Output error (delta) depends on cost derivative and activation slope
    delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
    nabla_b[-1] = delta
    nabla_w[-1] = np.dot(delta, activations[-2].transpose())
    # Loop backward through layers to propagate the error
    for l in range(2, self.num_layers):
      z = zs[-l]
      sp = sigmoid_prime(z)
      delta = np.dot(self.weights[-l + 1].transpose(), delta) * sp
      nabla_b[-l] = delta
      nabla_w[-l] = np.dot(delta, activations[-l - 1].transpose())
    return (nabla_b, nabla_w)

  def evaluate(self, test_data: List[Tuple[np.ndarray, int]]) -> int:
    """Evaluates the network accuracy on a test dataset.

    Args:
      test_data: List of (input, label) pairs.

    Returns:
      The count of correctly classified inputs.
    """
    test_results = [(np.argmax(self.feedforward(x)), y) for (x, y) in test_data]
    return sum(int(x == y) for (x, y) in test_results)

  def cost_derivative(self, output_activations: np.ndarray,
                      y: np.ndarray) -> np.ndarray:
    """Calculates the vector of partial derivatives for the quadratic cost.

    Args:
      output_activations: Predicted activations.
      y: Target activations.

    Returns:
      Difference vector between prediction and target.
    """
    return (output_activations - y)


def sigmoid(z: np.ndarray) -> np.ndarray:
  """The sigmoid activation function."""
  return 1.0 / (1.0 + np.exp(-z))


def sigmoid_prime(z: np.ndarray) -> np.ndarray:
  """The derivative of the sigmoid activation function."""
  return sigmoid(z) * (1 - sigmoid(z))
import random
from typing import List, Tuple, Optional
import numpy as np


class Network(object):
  """A module to implement stochastic gradient descent for a neural network.

  This class implements the core logic for a feedforward neural network, 
  including initialization, feedforward passes, and training via 
  backpropagation.
  """

  def __init__(self, sizes: List[int]):
    """Initializes the network architecture with random weights and biases.

    Args:
      sizes: List of integers representing neurons per layer.
    """
    self.num_layers = len(sizes)
    self.sizes = sizes
    self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
    self.weights = [
        np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])
    ]

  def feedforward(self, a: np.ndarray) -> np.ndarray:
    """Computes output activations for a given input vector.

    Args:
      a: Input activation vector.

    Returns:
      Output activation vector.
    """
    for b, w in zip(self.biases, self.weights):
      a = sigmoid(np.dot(w, a) + b)
    return a

  def SGD(self,
          training_data: List[Tuple[np.ndarray, np.ndarray]],
          epochs: int,
          mini_batch_size: int,
          eta: float,
          test_data: Optional[List[Tuple[np.ndarray, int]]] = None) -> None:
    """Trains the network using mini-batch stochastic gradient descent.

    Args:
      training_data: List of (input, target) pairs.
      epochs: Total training passes through the data.
      mini_batch_size: Number of samples per weight update.
      eta: Learning rate.
      test_data: Data for post-epoch evaluation.
    """
    if test_data:
      n_test = len(test_data)
    n = len(training_data)
    for j in range(epochs):
      random.shuffle(training_data)
      mini_batches = [
          training_data[k:k + mini_batch_size]
          for k in range(0, n, mini_batch_size)
      ]
      for mini_batch in mini_batches:
        self.update_mini_batch(mini_batch, eta)
      if test_data:
        print("Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data),
                                            n_test))
      else:
        print("Epoch {0} complete".format(j))

  def update_mini_batch(self, mini_batch: List[Tuple[np.ndarray, np.ndarray]],
                        eta: float) -> None:
    """Updates parameters based on gradient descent over a mini-batch.

    Args:
      mini_batch: Subset of training data.
      eta: Learning rate.
    """
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    for x, y in mini_batch:
      delta_nabla_b, delta_nabla_w = self.backprop(x, y)
      nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
      nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
    self.weights = [
        w - (eta / len(mini_batch)) * nw
        for w, nw in zip(self.weights, nabla_w)
    ]
    self.biases = [
        b - (eta / len(mini_batch)) * nb for b, nb in zip(self.biases, nabla_b)
    ]

  def backprop(self, x: np.ndarray,
               y: np.ndarray) -> Tuple[List[np.ndarray], List[np.ndarray]]:
    """Calculates the gradient of the cost function for a single example.

    Args:
      x: Input vector.
      y: Target activation vector.

    Returns:
      A tuple containing lists of gradient vectors/matrices for biases/weights.
    """
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    # Feedforward pass to store all intermediate activations
    activation = x
    activations = [x]  # Store activations layer by layer
    zs = []  # Store weighted inputs layer by layer
    for b, w in zip(self.biases, self.weights):
      z = np.dot(w, activation) + b
      zs.append(z)
      activation = sigmoid(z)
      activations.append(activation)
    # Backward pass to compute errors and gradients
    # Output error (delta) depends on cost derivative and activation slope
    delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
    nabla_b[-1] = delta
    nabla_w[-1] = np.dot(delta, activations[-2].transpose())
    # Loop backward through layers to propagate the error
    for l in range(2, self.num_layers):
      z = zs[-l]
      sp = sigmoid_prime(z)
      delta = np.dot(self.weights[-l + 1].transpose(), delta) * sp
      nabla_b[-l] = delta
      nabla_w[-l] = np.dot(delta, activations[-l - 1].transpose())
    return (nabla_b, nabla_w)

  def evaluate(self, test_data: List[Tuple[np.ndarray, int]]) -> int:
    """Evaluates the network accuracy on a test dataset.

    Args:
      test_data: List of (input, label) pairs.

    Returns:
      The count of correctly classified inputs.
    """
    test_results = [(np.argmax(self.feedforward(x)), y) for (x, y) in test_data]
    return sum(int(x == y) for (x, y) in test_results)

  def cost_derivative(self, output_activations: np.ndarray,
                      y: np.ndarray) -> np.ndarray:
    """Calculates the vector of partial derivatives for the quadratic cost.

    Args:
      output_activations: Predicted activations.
      y: Target activations.

    Returns:
      Difference vector between prediction and target.
    """
    return (output_activations - y)


def sigmoid(z: np.ndarray) -> np.ndarray:
  """The sigmoid activation function."""
  return 1.0 / (1.0 + np.exp(-z))


def sigmoid_prime(z: np.ndarray) -> np.ndarray:
  """The derivative of the sigmoid activation function."""
  return sigmoid(z) * (1 - sigmoid(z))

Loading the MNIST Data¶

Before we can test our network, we need a way to load the MNIST data into a format our Python code can understand. Since the raw data is stored in a compressed format, we use a helper library called mnist_loader.py.

The primary goal of this loader is to transform the pixel data and labels into the specific shapes—like $(784, 1)$ vectors—that our Network class expects for matrix multiplication.

Data Formatting and Wrappers¶

The loader provides two main functions:

load_data(): This retrieves the raw training, validation, and test sets.
load_data_wrapper(): This is the function we actually use. It reformats the raw data. For example, it converts a training label (like the number $5$) into a one-hot encoded vector, where the $5^{th}$ position is $1.0$ and all others are $0.0$.

In [10]:

Copied!





import gzip
import os
import pickle
from typing import Any, List, Tuple

import numpy as np
import requests


def download_mnist(filename: str = "mnist.pkl.gz") -> None:
  """Downloads the MNIST dataset if it does not exist locally.

  This ensures the 'load_data' function can find the file without manual 
  downloads. It creates a 'data' directory if one is missing.

  Args:
    filename: The name of the file to be saved locally.
  """
  data_dir = "../../datasets/MNIST"
  file_path = os.path.join(data_dir, filename)

  # Create the directory if it's missing
  if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    print(f"Created directory: {data_dir}")

  # Download the file if it's missing
  if not os.path.exists(file_path):
    url = "https://github.com/mnielsen/neural-networks-and-deep-learning/raw/master/data/mnist.pkl.gz"
    print(f"Downloading {filename}...")
    response = requests.get(url)
    with open(file_path, "wb") as f:
      f.write(response.content)
    print("Download complete.")
  else:
    print(f"Dataset '{filename}' already exists.")


def load_data() -> Tuple[Any, Any, Any]:
  """Returns the raw MNIST data as a tuple of training, validation, and test sets.

  The data is loaded from a compressed pickle file. Each set consists of 
  numpy ndarrays containing pixel values and labels.

  Returns:
    A tuple (training_data, validation_data, test_data).
  """
  # Open the compressed file in read-binary mode
  with gzip.open('../../datasets/MNIST/mnist.pkl.gz', 'rb') as f:
    # Use pickle to load the serialized data (pickle is the modern cPickle)
    training_data, validation_data, test_data = pickle.load(f,
                                                            encoding='latin1')
  return (training_data, validation_data, test_data)


def load_data_wrapper(
) -> Tuple[List[Tuple[np.ndarray, np.ndarray]], List[Tuple[np.ndarray, int]],
           List[Tuple[np.ndarray, int]]]:
  """Returns MNIST data in a format convenient for our Network implementation.

  This wrapper reformats the raw data into lists of tuples. Training labels 
  are converted to 10-dimensional unit vectors, while validation/test labels 
  remain as integers for easier evaluation.

  Returns:
    A tuple (training_data, validation_data, test_data).
    training_data: 50,000 2-tuples (x, y). x is a (784, 1) image vector, 
      y is a (10, 1) unit vector.
    validation_data: 10,000 2-tuples (x, y). x is a (784, 1) image vector, 
      y is the integer label.
    test_data: 10,000 2-tuples (x, y). x is a (784, 1) image vector, 
      y is the integer label.
  """
  tr_d, va_d, te_d = load_data()

  # Reshape training images into (784, 1) column vectors
  training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
  # Convert training labels (0-9) into one-hot encoded (10, 1) vectors
  training_results = [vectorized_result(y) for y in tr_d[1]]
  # Combine inputs and results into pairs
  training_data = list(zip(training_inputs, training_results))

  # Prepare validation and test data similarly, but keep labels as integers
  validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
  validation_data = list(zip(validation_inputs, va_d[1]))

  test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
  test_data = list(zip(test_inputs, te_d[1]))

  return (training_data, validation_data, test_data)


def vectorized_result(j: int) -> np.ndarray:
  """Returns a 10-dimensional unit vector with 1.0 at index j.

  This is used to convert a digit (0-9) into the 'ideal' output vector 
  the neural network should aim to produce.

  Args:
    j: The digit label to vectorize.

  Returns:
    A (10, 1) numpy array filled with 0.0, except at index j.
  """
  e = np.zeros((10, 1))
  e[j] = 1.0
  return e
import gzip
import os
import pickle
from typing import Any, List, Tuple

import numpy as np
import requests


def download_mnist(filename: str = "mnist.pkl.gz") -> None:
  """Downloads the MNIST dataset if it does not exist locally.

  This ensures the 'load_data' function can find the file without manual 
  downloads. It creates a 'data' directory if one is missing.

  Args:
    filename: The name of the file to be saved locally.
  """
  data_dir = "../../datasets/MNIST"
  file_path = os.path.join(data_dir, filename)

  # Create the directory if it's missing
  if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    print(f"Created directory: {data_dir}")

  # Download the file if it's missing
  if not os.path.exists(file_path):
    url = "https://github.com/mnielsen/neural-networks-and-deep-learning/raw/master/data/mnist.pkl.gz"
    print(f"Downloading {filename}...")
    response = requests.get(url)
    with open(file_path, "wb") as f:
      f.write(response.content)
    print("Download complete.")
  else:
    print(f"Dataset '{filename}' already exists.")


def load_data() -> Tuple[Any, Any, Any]:
  """Returns the raw MNIST data as a tuple of training, validation, and test sets.

  The data is loaded from a compressed pickle file. Each set consists of 
  numpy ndarrays containing pixel values and labels.

  Returns:
    A tuple (training_data, validation_data, test_data).
  """
  # Open the compressed file in read-binary mode
  with gzip.open('../../datasets/MNIST/mnist.pkl.gz', 'rb') as f:
    # Use pickle to load the serialized data (pickle is the modern cPickle)
    training_data, validation_data, test_data = pickle.load(f,
                                                            encoding='latin1')
  return (training_data, validation_data, test_data)


def load_data_wrapper(
) -> Tuple[List[Tuple[np.ndarray, np.ndarray]], List[Tuple[np.ndarray, int]],
           List[Tuple[np.ndarray, int]]]:
  """Returns MNIST data in a format convenient for our Network implementation.

  This wrapper reformats the raw data into lists of tuples. Training labels 
  are converted to 10-dimensional unit vectors, while validation/test labels 
  remain as integers for easier evaluation.

  Returns:
    A tuple (training_data, validation_data, test_data).
    training_data: 50,000 2-tuples (x, y). x is a (784, 1) image vector, 
      y is a (10, 1) unit vector.
    validation_data: 10,000 2-tuples (x, y). x is a (784, 1) image vector, 
      y is the integer label.
    test_data: 10,000 2-tuples (x, y). x is a (784, 1) image vector, 
      y is the integer label.
  """
  tr_d, va_d, te_d = load_data()

  # Reshape training images into (784, 1) column vectors
  training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
  # Convert training labels (0-9) into one-hot encoded (10, 1) vectors
  training_results = [vectorized_result(y) for y in tr_d[1]]
  # Combine inputs and results into pairs
  training_data = list(zip(training_inputs, training_results))

  # Prepare validation and test data similarly, but keep labels as integers
  validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
  validation_data = list(zip(validation_inputs, va_d[1]))

  test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
  test_data = list(zip(test_inputs, te_d[1]))

  return (training_data, validation_data, test_data)


def vectorized_result(j: int) -> np.ndarray:
  """Returns a 10-dimensional unit vector with 1.0 at index j.

  This is used to convert a digit (0-9) into the 'ideal' output vector 
  the neural network should aim to produce.

  Args:
    j: The digit label to vectorize.

  Returns:
    A (10, 1) numpy array filled with 0.0, except at index j.
  """
  e = np.zeros((10, 1))
  e[j] = 1.0
  return e

With the data loader ready, we can now move on to training and testing the network.

Testing the Performance¶

When we run this network with $30$ hidden neurons, we see something remarkable:

In [11]:

Copied!





# Run this before calling load_data_wrapper()
download_mnist()

# Load data and initialize a network with 784 inputs, 30 hidden, and 10 outputs
training_data, validation_data, test_data = load_data_wrapper()
net = Network([784, 30, 10])
# Run SGD for 30 epochs with batch size 10 and learning rate 3.0
net.SGD(training_data, 30, 10, 3.0, test_data=test_data)
# Run this before calling load_data_wrapper()
download_mnist()

# Load data and initialize a network with 784 inputs, 30 hidden, and 10 outputs
training_data, validation_data, test_data = load_data_wrapper()
net = Network([784, 30, 10])
# Run SGD for 30 epochs with batch size 10 and learning rate 3.0
net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

Downloading mnist.pkl.gz...
Download complete.

/var/folders/k8/h1l_hrqs1gdfmyrtv74dvgb40000gn/T/ipykernel_8067/3083060216.py:51: VisibleDeprecationWarning: dtype(): align should be passed as Python or NumPy boolean but got `align=0`. Did you mean to pass a tuple to create a subarray type? (Deprecated NumPy 2.4)
  training_data, validation_data, test_data = pickle.load(f,

Epoch 0: 9033 / 10000
Epoch 1: 9241 / 10000
Epoch 2: 9264 / 10000
Epoch 3: 9357 / 10000
Epoch 4: 9367 / 10000
Epoch 5: 9372 / 10000
Epoch 6: 9387 / 10000
Epoch 7: 9423 / 10000
Epoch 8: 9452 / 10000
Epoch 9: 9463 / 10000
Epoch 10: 9469 / 10000
Epoch 11: 9438 / 10000
Epoch 12: 9476 / 10000
Epoch 13: 9461 / 10000
Epoch 14: 9475 / 10000
Epoch 15: 9474 / 10000
Epoch 16: 9481 / 10000
Epoch 17: 9510 / 10000
Epoch 18: 9500 / 10000
Epoch 19: 9510 / 10000
Epoch 20: 9491 / 10000
Epoch 21: 9495 / 10000
Epoch 22: 9505 / 10000
Epoch 23: 9500 / 10000
Epoch 24: 9503 / 10000
Epoch 25: 9519 / 10000
Epoch 26: 9532 / 10000
Epoch 27: 9492 / 10000
Epoch 28: 9527 / 10000
Epoch 29: 9495 / 10000

Interpreting the Accuracy Results¶

Within just a few epochs, the network reaches over 95% accuracy. In our training output, this is represented by the raw count of correct classifications out of the total test set. For example, a result of 9532 / 10000 translates to a classification rate of $95.32\%$.

While the network's performance may fluctuate slightly between epochs—sometimes dipping from 95.10% back to 94.91%—this is a normal byproduct of the "stochastic" nature of our algorithm. Because we sample random mini-batches, the path to the optimal weights is a jagged one rather than a smooth, straight line.

The Art of Hyper-parameters¶

Note how I chose a learning rate ($\eta$) of $3.0$. If we chose $0.001$, the learning would be painfully slow, and the network might take hundreds of epochs to reach 90%. If we chose $100.0$, the network would "overshoot" the minimum and fail to learn entirely, often resulting in accuracy no better than random guessing ($10\%$).

Debugging a neural network is an art. If your results look like random noise or the accuracy isn't climbing, you have to ask:

Is the learning rate too high or too low? High rates cause oscillation; low rates cause stagnation.
Do I have enough hidden neurons to "see" the patterns? More neurons allow for more complex feature detection.
Have I run enough epochs? Sometimes the gradient is simply shallow and requires more iterations to descend.