SP

Stochastic Gradient Descent & The Error Backpropagation Calculus

1/8/2025

The backpropagation is main algorithm used for training neural network via gradient descend. To understand deeply the topics in this essay require a basic knowledge of calculus and linear algebra.

What does it mean training a network so that it could learn and generalize to solve our problem? Training a network means adjusting his weight and biases according on how the model perform better for our current problem. To achieve this result we aim to reduce the function of loss. This is what gradient descend stands for. Finding the local minimum of our function of loss to minimize the error. So the gradient descend is just a routine of optimization of our model.

Backpropagation can make training a network with gradient descend as much as ten million times faster, relative to a naive implementation.

Backpropagation is mainly used for training neural networks by efficiently computing gradients needed for optimization. However, its underlying principle of reverse-mode automatic differentiation has found applications in other fields such as scientific computing, control theory, computer graphics, and engineering design. In these areas, backpropagation helps optimize complex systems by enabling fast and accurate calculation of sensitivities and parameter gradients.

In this essay I’m going to use sources like the excellent free book on neural network written by Micheal Nilsen, you will find all the resources at the end of the essay.

Computational graph

Computational graphs are very useful in computer science, they allow you to track and compute complex mathematical expressions by breaking them down into a sequence of simple operations. Each node in a computational graph represents an operation (like addition, multiplication, or a more complex function), and each edge carries data, often in the form of tensors or scalars.

For example consider the expression: e=(a+b)(b+1)e = (a+b) * (b+1)

We can add a intermediate step: c=a+bc = a+b and d=b+1d = b+1. And get the result e=cde = c * d.

To evaluate the expression we can set the input variable to certain values, for example a=2, b=1.

To know how ee varies changes aa or bb we have to compute partial derivatives in the computational graph. For evaluating this graph we need the sum rule and the product rule.

a(a+b)=aa+ba=1\frac{\partial }{\partial a}(a+b) = \frac{\partial a}{\partial a} + \frac{\partial b}{\partial a} = 1 u(uv)=uuv+vuu \frac{\partial }{\partial u}(uv) = \frac{\partial u}{\partial u}\cdot v+\frac{\partial v}{\partial u}\cdot u

The key idea to understand the efficiency of backpropagation and follow the next part of the essay is to grasp the difference between forward mode differentiation and backward mode differentiation. Forward-mode differentiation starts at an input to the graph and moves towards the end. At every node, it sums all the paths feeding in. Each of those paths represents one way in which the input affects that node. By adding them up, we get the total way in which the node is affected by the input, it’s derivative. Reverse-mode differentiation, on the other hand, starts at an output of the graph and moves towards the beginning. At each node, it merges all paths which originated at that node.

Let’s take for example we have a function of three variable:

f(x,y,z)=zf(x,y,z) = z

The total differential for a function of three variable is:

df=fxdx+fydy+fzdf = \frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial y}dy+\frac{\partial f}{\partial z}

If in turn x,y,z depends on a variable t, we can apply the chain rule and the total differential becomes:

ft=fxxt+fyyt+fzzt\frac{\partial f}{\partial t} = \frac{\partial f}{\partial x}\cdot\frac{\partial x}{\partial t} + \frac{\partial f}{\partial y}\cdot\frac{\partial y}{\partial t}+\frac{\partial f}{\partial z}\cdot\frac{\partial z}{\partial t}

To understand how aa changes according to nodes that are not directly connected to it we just applying the chain rule. Let’s analyze how ee changes if aa changes:

ea=ecca=21=2\frac{\partial e}{\partial a} = \frac{\partial e}{\partial c}\cdot \frac{\partial c}{\partial a} = 2\cdot 1=2

Let’s analyze how ee changes if bb changes, in this case we have to sum the contribution for multiple paths.

eb=eccb+eddb=21+31=5\frac{\partial e}{\partial b} = \frac{\partial e}{\partial c}\cdot \frac{\partial c}{\partial b} + \frac{\partial e}{\partial d}\cdot \frac{\partial d}{\partial b} = 2 \cdot 1 + 3 \cdot 1 = 5

Now let’s make a concrete example using the previous expression.
Understanding this will help you appreciate why backpropagation is so efficient when training neural networks.

To see how the output ee changes when the input bb changes, we could use a forward-mode differentiation approach:
We would start from the node where b=1b=1 and follow all paths forward through the graph, summing the contributions that lead to ee.

However, we would have to repeat this process separately for each input (like a, b, etc.) — which can be slow when there are many inputs, as in neural networks.

With this approach we would have to do the same thing for the aa node, and then sum up the contributions of all the derivatives of all the input nodes.

Using a backward differentiation approach (reverse-mode), we start from the final output and propagate gradients backward through the graph.
This allows us to compute the partial derivatives of the output with respect to all input variables in a single backward pass, thanks to factoring and reusing common paths via the chain rule.

This example show how if we have a function with multiple inputs and few outputs a function like:

f:RnRf:\mathbb{R}^n\to \mathbb{R}

Is much more efficient using backward differentiation to understand how the output changes of the input changes. If for example we have 1,000,000 inputs, forward differentiation would require 1,000,000 passes through the network, while backpropagation differentiation would require just 2 passes through the network (one for evaluating the expression at each node and one for calculating the derivatives). So reverse-mode is ~500,000× faster in practice for this case!

Structure of a neural network and notation

This is a simple schema of a neural network, it’s composed of one layer of input, one hidden layer and one output layer.

notation

Simple photo that follows show the notation on neurons connected between each other. Two neurons are connected by a weight, the value of a neuron is given by ajla_{j}^l where l represents the current layer while jj stands for which neuron in the layer we are referring to. Each neuron has a bias (btlb_{t}^l), the meaning of the notation if the same as the previous.

In a neural network each neuron is connected with all the other neurons in the next layer, and the strength of each connection is characterized by a weight. In addition to these weighted connections, each neuron also has a bias term, which acts like an adjustable constant input that allows the neuron to shift its activation function, helping the network better fit the data by enabling the output to be offset independently of the inputs. The neuron computes a weighted sum of its inputs, adds the bias, and then passes the result through an activation function — a nonlinear transformation such as ReLU, sigmoid, or tanh. This nonlinearity is crucial, as it allows the network to learn and approximate complex, non-linear mappings between inputs and outputs. Without activation functions, no matter how many layers the network has, it would behave like a linear model.

This photos shows the steps to evaluate the level of activation of one neuron, zj1z_{j}^1 is an intermediate step, while aj1a_{j}^1 is the final level of activation.

Each neuron of the \ell layer takes as input the activation level of all the other neurons in the 1\ell -1 layer of the network. We can calculate the activation of the neuron jj in the layer \ell as:

aj=σ(kwjkak1+bj)(1)a_{j}^{\ell} = \sigma \left(\sum_{k} w_{jk}^{\ell} a_{k}^{\ell -1} + b_{j}^{\ell}\right) \tag{1}

I will be much easier to manage if we would translate this in vector form, it’s simple to show that the last expression is equal to:

a=σ(Wa1+b)a^{\ell}=\sigma\left( W^{\ell}a^{\ell-1} + b^{\ell} \right)

And our intermediate step will be:

z=Wa1+bz^{\ell} = W^{\ell}a^{\ell-1} + b^{\ell}

Cost function

When training a neural network, we begin by defining a loss function l(h(x),y)l(h(x),y) that quantifies how far the model’s prediction h(x)h(x) is from the true output yy for a single data point. This single-point loss is exactly what we use to compute gradients via backpropagation, updating the model’s weights to reduce the error. However, in practice, we don’t just have one example — we are given a dataset of samples {(X1,Y2),,(Xn,Yn)}\{(X_{1},Y_{2}),…,(X_{n},Yn)\}, which are drawn from some unknown distribution P(x,y)P(x,y). Our goal would ideally be to minimize the expected loss over this distribution:

E(x,y)P[(h(x),y)]=(h(x),y)dP(x,y)\mathbb{E}_{(x,y) \sim P} \left[ \ell(h(x), y) \right] = \int \ell(h(x),y) dP(x,y)

But since P(x,y)P(x,y) is unknown, we cannot compute this expectation directly. Instead, we approximate it using the empirical average over our dataset:

C^=1nx(h(Xi),Yi)(2)\hat{C} = \frac{1}{n} \sum_{x} \ell(h(X_{i}),Y_{i}) \tag{2}

This empirical loss is what we minimize in practice. And since it’s a sum of differentiable single-sample losses, we can still apply backpropagation, either over the entire sum (in batch gradient descent) or over small subsets (in stochastic or mini-batch gradient descent). Thus, the justification for using backpropagation over many samples stems from the fact that the empirical loss is an estimator of the expected loss — and the best tool we have for reducing it with gradient-based methods.

This cost function takes as input all the weights and biases of the network and measures how far off the network’s predictions are compared to the actual target outputs.

The choice on which cost function to choose varies according to the problem we are trying to solve. For this essay I’m going to take in consideration the simplest cost function that is the Mean squared error:

C(w,b)=1nxy(x)aL(x)2(3)C(w,b)=\frac{1}{n}\sum_{x} ||y(x)− a^L(x)||^2 \tag{3}

Where y(x)y(x) is the desired output, aa is the output we can, it obviously depends on the weights and biases of the network, while xx stands for the training samples.

Indeed the cost for a simple training example is:

Cx=12yaL2(4)C_{x} = \frac{1}{2} ||y-a^L||^2 \tag{4}

What backpropagation allow us to do is to calculate CXw\frac{\partial C_{X}}{\partial w} and Cxb\frac{\partial C_{x}}{\partial b} and then finding Cw\frac{\partial C}{\partial w} and Cb\frac{\partial C}{\partial b} averaging over the training samples.

I said earlier that the cost function take as input all the weight and biases of the network, I need to precise better that. The cost function take as input the last layer of the network, or also called the “prediction” of the network, the prediction depends on all the weights and biases of the network. The cost function measures how close or far is the prediction compared to the actual result, backpropagation through gradient descend, is an algorithm to make this prediction closes as possible.

Backpropagation derivation

Once understood the problem we are trying to solve, let’s try to address the problem in a more mathematical way. The heart of backpropagation is understand how the cost function varies, if the matrix of weights of the first layer varies. Or in a mathematical way:

C(aL(aL1)(al(wl)))wl\frac{\partial C \left( a^L\left( a^{L-1}\right) \dots\left( a^l\left(w^l \right)\right)\right)}{\partial w^l}
  • CC is a scalar loss function (like MSE or cross-entropy)

  • wlRn×mw^l \in \mathbb{R}^{n\times m} is the weight matrix for layer ll

  • The output CC depends on the activations, which depend on the weights from all previous layers, including wlw^l

What we want to do is to apply the chain rule layer by layer through the computational graph.

We have a scalar function CC that depends on a vector aLa^L that depends on a vector that depends on a vector… so on until we have a vector that depends on a matrix wlw^l.

Since the cost CC is a scalar, and the weights wlw^l are a matrix, we’re computing the derivative of a scalar function with respect to a matrix.

But this scalar depends on a composition of many functions — vectors and matrices interacting across layers. So to compute this derivative, we’ll need to use:

  • Derivative of a scalar with respect to vector (gradient)

  • Derivative of a scalar with respect to matrix

  • Derivative of a vector with respect to a vector (Jacobian)

  • Chain rule for matrix calculus

Derivative of a scalar function with respect to a vector

Let C:RnRC:\mathbb{R}^n\to \mathbb{R}, with v=(x1,x2,,xn)v=(x_1,x_{2},\dots,x_{n}).

Cv=C=(Cx1Cx2Cxn)\frac{\partial C}{\partial v} = \nabla C =\begin{pmatrix} \frac{\partial C}{\partial x_{1}} \\ \frac{\partial C}{\partial x_{2}} \\ \vdots \\ \frac{\partial C}{\partial x_{n}} \end{pmatrix}

Example: suppose v=aLv=a^L , C=C(aL)C=C(a^L), then:

CaL=(Ca1LCa2LCanL)\frac{\partial C}{\partial a^L} = \begin{pmatrix} \frac{\partial C}{\partial a_{1}^L} \\ \frac{\partial C}{\partial a_{2}^L} \\ \vdots \\ \frac{\partial C}{\partial a_{n}^L} \end{pmatrix}

So: CaL=aLCRn×1\frac{\partial C}{\partial a^L} = \nabla_{a_{L}} C\in \mathbb{R}^{n\times1}

The derivative of a scalar function with respect to a vector is just the gradient of that function, **this tells us how much the cost changes when each component of the vector aLa^L changes.

Vector function of a vector variable

Let f:RnRmf:\mathbb{R}^n\to \mathbb{R}^m,x=(x1,x2,xn)x=(x_{1},x_{2}\dots,x_{n}), then:

f(x)=(f1(x)f2(x)fm(x))f(x) = \begin{pmatrix} f_{1}(x) \\ f_{2}(x) \\ \vdots \\ f_{m}(x) \end{pmatrix}

Each fif_{i} is a scalar function of the same vector xRnx \in \mathbb{R}^n. Then the derivative is the Jacobian matrix:

fx=J=(f1x1f1xnfmx1fmxn)\frac{\partial f}{\partial x} = J= \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{pmatrix}

This generalizes the gradient it tells how each component of the output vector changes according to every input component. Each row is the gradient of one component function fi(x)f_{i}(x). So the Jacobian is a stack of gradients.

Derivative of a scalar function with respect to a matrix

f:Rn×mRf:\mathbb{R}^{n \times m}\to \mathbb{R}

So ff is a scalar function of a matrix. The derivative is defined as:

fW=(fw11fw1mfwn1fwnm)Rn×m \frac{\partial f}{\partial W} = \begin{pmatrix} \frac{\partial f}{\partial w_{11}} & \cdots & \frac{\partial f}{\partial w_{1m}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial w_{n1}} & \cdots & \frac{\partial f}{\partial w_{nm}} \\ \end{pmatrix} \in \mathbb{R}^{n \times m}

We differentiate the scalar function with respect to each component of the matrix — and the result is again a matrix.

Turning back to our final objective, that is to compute Cwl\frac{\partial C}{\partial w^l}. This falls under the case of a scalar function of a matrix, and produces a matrix of partial derivatives. However, to compute it via the chain rule, we often encounter intermediate expressions such as the derivative of a vector with respect to a matrix.

This brings us to study vector-valued functions of matrix variables:

f:Rn×mRkf:\mathbb{R}^{n\times m}\to \mathbb{R}^k

This function takes as input a matrix and gives out a vector, so the function would look something like:

f(W)=(f1(W)f2(W)fk(W))f(W)=\begin{pmatrix} f_{1}(W) \\ f_{2}(W) \\ \vdots \\ f_{k}(W) \end{pmatrix}

What is CW\frac{\partial C}{\partial W}?

Each component function fi(W)f_{i}(W) is a scalar function of the matrix WW. So we can take its derivative just like before:

fiWRn×m\frac{\partial f_{i}}{\partial W}\in R^{n\times m}

Now since we have kk of them we just stuck these matrixes the one after the other. That gives us a cube-like object with shape:

fWRk×n×m\frac{\partial f}{\partial W} \in \mathbb{R}^{k \times n \times m}

This is a rank-3 tensor. This tensor tells us how each output component fif_{i} changes with each element wjkw_{jk}​ of the input matrix.

In our case the function ff corresponds to ala^l, and the components of ala^l are the individual activations of aila_{i}^l, which correspond to the different fif_{i} .

This is an example of matrixes for the first 3 layers of the network: layer 1, layer 2 and layer 3.

Figure 2: In our case, the function ff corresponds to ala^l, and the components of ala^l are the individual activations aila^l_i, which correspond to the different fif_i.

We understood how to differentiate with respect to vectors an matrixes, now we need to find the namesake of the chainrule but for vectors. So we need to evaluate something like:

C(y(x))x\frac{\partial C \left(y(x) \right)}{\partial x}

Where:

  • CC is a scalar function
  • y:RnRmy:\mathbb{R}^n\to \mathbb{R}^m is a vector valued function
  • and xRnx \in \mathbb{R}^n The composition C(y(x))C(y(x)) is still a scalar function — it maps xRnx∈\mathbb{R}^n to a real number via yy.

As we saw before, we are taking the derivative of a scalar function with respect to a vector, and that is just the gradient.

C(y(x))x=cC=(Cx1Cx2Cxn)\frac{\partial C \left(y(x) \right)}{\partial x} =\nabla_{c} C=\begin{pmatrix} \frac{\partial C}{\partial x_{1}} \\ \frac{\partial C}{\partial x_{2}} \\ \vdots \\ \frac{\partial C}{\partial x_{n}} \end{pmatrix}

But now, we can go a step further and express how CC depends on xx through y(x)y(x).

C=C(y1(x),y2(x),,ym(x))C=C(y_{1}(x),y_{2}(x),\dots,y_{m}(x))

So what we are dealing with is a composition of functions, so we use the chain rule for multivariable calculus we have seen when we were dealing with the computational graphs:

Cxj=i=1mCyiyixj\frac{\partial C}{\partial x_{j}}=\sum_{i=1}^m \frac{\partial C}{\partial y_{i}} \cdot\frac{\partial y_{i}}{\partial x_{j}}

This tells us: to compute how CC changes as we vary xjx_{j}​, we look at how each yiy_{i}​ changes with xjx_{j}​, weighted by how sensitive CC is to yiy_{i}​.

xx is a vector of nn components, so if we stack all these pieces of chain rule together we get:

xC=(i=1mCyiyix1i=1mCyiyix2i=1mCyiyixn)\nabla_{x} C=\begin{pmatrix} \sum_{i=1}^m \frac{\partial C}{\partial y_{i}} \cdot\frac{\partial y_{i}}{\partial x_{1}} \\ \sum_{i=1}^m \frac{\partial C}{\partial y_{i}} \cdot\frac{\partial y_{i}}{\partial x_{2}} \\ \vdots \\ \sum_{i=1}^m \frac{\partial C}{\partial y_{i}} \cdot\frac{\partial y_{i}}{\partial x_{n}} \end{pmatrix}

Matrix Form: Vector Chain Rule

Rather than writing all those sums explicitly, we can express this compactly using matrix multiplication.

Let’s define:

  • The Jacobian matrix yxRm×n\frac{\partial y}{\partial x}\in \mathbb{R}^{m \times n} , whose (i,j)th(i,j)-th entry is yixj\frac{\partial y_{i}}{\partial x_{j}}
  • The gradient of CC with respect to yy, yC=Rm\nabla_{y} C =\mathbb{R}^m, whose ii components is Cyi\frac{\partial C}{\partial y_{i}} Then the full gradient is:
cC=(yx)yC(5)\nabla_{c} C = \left(\frac{\partial y}{\partial x}\right)^{\top} \cdot \nabla_{y} C \tag{5}

That is the Chain rule for vector-valued function.

Chain Rule with Matrix-Valued Variables

We now want to compute the derivative of a scalar function composed with a function of a matrix:

C(Y(W))W\frac{\partial C(Y(W))}{\partial W}

Differentiate a scalar with respect to a matrix:

CW=(Cw11Cw12Cw1nCw21Cw22Cw2nCwm1Cwm2Cwmn)\frac{\partial C}{\partial W} = \begin{pmatrix} \frac{\partial C}{\partial w_{11}} & \frac{\partial C}{\partial w_{12}} & \cdots & \frac{\partial C}{\partial w_{1n}} \\ \frac{\partial C}{\partial w_{21}} & \frac{\partial C}{\partial w_{22}} & \cdots & \frac{\partial C}{\partial w_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial C}{\partial w_{m1}} & \frac{\partial C}{\partial w_{m2}} & \cdots & \frac{\partial C}{\partial w_{mn}} \\ \end{pmatrix}

Our case the scalar C depends on a vector yy, and that vector depends on the matrix WW

  • C:RmRC: \mathbb{R}^{m} \to \mathbb{R} is a scalar function
  • y:Rk×nRmy: \mathbb{R}^{k \times n} \to \mathbb{R}^{m} is a vector-valued function
  • WRk×nW \in \mathbb{R}^{k \times n} is a matrix input

In our case the scalar function CC Depends on a vector yy and the vector depends on a matrix WW.

This leads us to the composition

C(y(W))C\left(y\left(W\right)\right)

With yy with the form of:

y=(y1(W)y2(W)yk(W))y=\begin{pmatrix} y_{1}(W) \\ y_{2}(W) \\ \vdots \\ y_{k}(W) \end{pmatrix}

Now we have to use the matrix definition looking at the components of the matrix WW:

[C(y(W))W]jk=C(y(W))wjk\left[ \frac{\partial C\left(y(W)\right)}{\partial W}\right]_{jk} = \frac{\partial C\left(y(W)\right)}{\partial w_{jk}}

How CC depends on wijw_{ij}? It depends on yy that depends on WW. So let’s apply the chain rule but for scalars.

C(y(W))wjk=i=1mCyiyiwjk\frac{\partial C(y(W))}{\partial w_{jk}} = \sum_{i=1}^m \frac{\partial C}{\partial y_i} \cdot \frac{\partial y_i}{\partial w_{jk}}

Now we can put it back into the full matrix:

C(y(W))W=(iCyiyiw11iCyiyiw1niCyiyiw21iCyiyiw2niCyiyiwm1iCyiyiwmn)\frac{\partial C(y(W))}{\partial W} = \begin{pmatrix} \sum_i \frac{\partial C}{\partial y_i} \cdot \frac{\partial y_i}{\partial w_{11}} & \cdots & \sum_i \frac{\partial C}{\partial y_i} \cdot \frac{\partial y_i}{\partial w_{1n}} \\ \sum_i \frac{\partial C}{\partial y_i} \cdot \frac{\partial y_i}{\partial w_{21}} & \cdots & \sum_i \frac{\partial C}{\partial y_i} \cdot \frac{\partial y_i}{\partial w_{2n}} \\ \vdots & \ddots & \vdots \\ \sum_i \frac{\partial C}{\partial y_i} \cdot \frac{\partial y_i}{\partial w_{m1}} & \cdots & \sum_i \frac{\partial C}{\partial y_i} \cdot \frac{\partial y_i}{\partial w_{mn}} \\ \end{pmatrix}

Each entry in this matrix is a sum of scalar-by-scalar products. We just applied the scalar chain rule to each component and it’s all there in one matrix.

Let’s simplify the notation.

  • yW\frac{\partial y}{\partial W} is a tensor because it’s a vector with respect to a matrix.
  • Cy\frac{\partial C}{\partial y} it’s a vector So we can write what we wrote as:
C(y(W))W=i=1kCyiyiW\frac{\partial C(y(W))}{\partial W} = \sum_{i=1}^k \frac{\partial C}{\partial y_{i}} \cdot \frac{\partial y_{i}}{\partial W}

Using Tensor contraction.

To simplify again our notation to scale it better on our neural network we need to use a tensor contraction. A tensor contraction is an operation on a tensor that arises from the canonical pairing of a vector space and its dual. When we “contract” over an axis, we’re summing along that dimension reducing its size.

So the final expression is:

CW=i=1kCyiyiWCW=yWiCy(6)\frac{\partial C}{\partial W} = \sum_{i=1}^k \frac{\partial C}{\partial y_{i}} \cdot \frac{\partial y_{i}}{\partial W} \to \frac{\partial C}{\partial W} = \frac{\partial y}{\partial W} \overset{i}{\cdot} \frac{C}{\partial y} \tag{6}

That’s how we simplify the chain rule when the variable is a matrix and the intermediate function is vector-valued.

Backpropagation algorithm

Our main goal is to compute:

C(aL(aL1)(al(wl)))wl(7)\frac{\partial C \left( a^L\left( a^{L-1}\right) \dots\left( a^l\left(w^l \right)\right)\right)}{\partial w^l} \tag{7}

We are taking the derivative with respect to a matrix, but this scalar depends on a vector, that depends on a vector that depends on a vector and so on until it depends on the matrix wlw^l.

We start from the outermost layer, the layer linked with the matrix wlw^l and we compute the tensor contraction:

CW=alwliCal(8)\frac{\partial C}{\partial W} = \frac{\partial a^l}{\partial w^l} \overset{i}{\cdot} \frac{C}{\partial a^l} \tag{8}

But the vector ala^l depends on the vector al+1a^{l+1} and so on until the last layer aLa^L, to compute how each layer change each other we use the chain rule for vectors we found earlier:

Cal=(al+1al)Cal+1(9)\frac{\partial C}{\partial a^l} = \left(\frac{\partial a^{l+1}}{\partial a^l}\right)^{\top} \cdot \frac{\partial C}{\partial a^{l+1}} \tag{9}

And this is true for every layer:

Cal+1=(al+2al+1)Cal+2\frac{\partial C}{\partial a^{l+1}} = \left(\frac{\partial a^{l+2}}{\partial a^{l+1}}\right)^{\top} \cdot \frac{\partial C}{\partial a^{l+2}}

And that is the recursive hear of backpropagation until the last layer aLa^L.

Error backpropagation signal

Some sources define the error signal as:

δl:=Cal\delta^l := \frac{\partial C}{\partial a^l}

This can be used recursively:

Cal=(al+1al)Cal+1δl=(al+1al)δl+1\frac{\partial C}{\partial a^l} = \left(\frac{\partial a^{l+1}}{\partial a^l}\right)^{\top} \cdot \frac{\partial C}{\partial a^{l+1}} \to \delta^l= \left(\frac{\partial a^{l+1}}{\partial a^l}\right)^{\top} \cdot \delta^{l+1}

However, in standard backpropagation, the error is usually defined as:

δl:=Czl\delta^l := \frac{\partial C}{\partial z^l}

The reason for the last definition is the most straight forward:

If we are for example in the neuron jthj^{th} in the layer one. If we make a little change to the weighed input of the neuron zj1z_{j}^1 adding the quantity Δzj1\Delta z_{j}^1, then the activation input instead of outputting σ(zj1)\sigma(z_{j}^1) it will output σ(zj1+Δzj1)\sigma(z_{j}^1 + \Delta z_{j}^1 ). This change propagates through all the network and finally change the cost function of a quantity:

Czj1Δzj1\frac{\partial C}{\partial z_{j}^1}\cdot\Delta z_{j}^1

If we want to try to minimize the cost function, we can try to chose the quantity Δzj1\Delta z_{j}^1 so that makes the cost smaller. If the quantity Czj1\frac{\partial C}{\partial z_{j}^1} is close to zero then there is little to change, indeed if the quantity Czj1\frac{\partial C}{\partial z_{j}^1} is big we can chose the quantity Δzj1\Delta z_{j}^1 of the opposite sign to try to minimize the cost.

In this heuristic sense, this quantity Czj1\frac{\partial C}{\partial z_{j}^1} is the measure of the error of the neuron jthj^{th} in the layer 1.

δj1=Czj1\delta_{j}^1=\frac{\partial C}{\partial z_{j}^1}

Why both definitions can coexist

There is no contradiction between defining the error as Cal\frac{\partial C}{\partial a^l} or Czl\frac{\partial C}{\partial z^l}, because they are directly related through the chain rule. Since al=σ(zl)a^l = \sigma(z^l), we apply the chain rule:

Czl=Calalzl\frac{\partial C}{\partial z^l} = \frac{\partial C}{\partial a^l} \cdot \frac{\partial a^l}{\partial z^l}

Because σ\sigma is applied elementwise, this becomes an elementwise (Hadamard) product:

Czl=Calσ(zl)\frac{\partial C}{\partial z^l} = \frac{\partial C}{\partial a^l} \odot \sigma'(z^l)

That is,

δl=δ~lσ(zl)\delta^l = \tilde{\delta}^l \odot \sigma'(z^l)

where

δ~l:=Cal\tilde{\delta}^l := \frac{\partial C}{\partial a^l}

So depending on the context, one may define and propagate δ~l\tilde{\delta}^l (via the recursive expression), and then obtain the standard error δl\delta^l by applying the derivative of the activation. They are just two steps in the same computation, not conflicting definitions.

This is the key equation for backpropagation that links the error of one layer to the layer of the next one.

δl=(al+1al)δl+1(11)\delta^l= \left(\frac{\partial a^{l+1}}{\partial a^l}\right)^{\top} \cdot \delta^{l+1} \tag{11}

Since we are interested in finding:

Cwl for all layers l\frac{\partial C}{\partial w^l} \text{ for all layers }l

Once we know the vector δl\delta^l we can compute the weight gradient using a tensor contraction (using equation 8).

Cwl=alwliδl(12)\frac{\partial C}{\partial w^l} = \frac{\partial a^l}{\partial w^l} \overset{i}{\cdot} \delta^l \tag{12}

Before jumping to the algorithm itself and the implementation though code, I think is due to spend at least a little time explaining how the gradient descend work. How exactly finding all these derivative and stuff how actually is going to make our neural network learn something?

Gradient descend

The gradient is a vector that points in the direction of the greatest rate of increase of a scalar field.

The gradient is calculable only of a scalar-valued differentiable function.

So the gradient is defined only for functions that take a ndimentionn-dimention vector and gives as output a scalar.

f:RnRf:\mathbb{R}^n\to\mathbb{R}

So you can use gradients only with scalar fields.

We can define the gradient of a function f(r)f(r) as:

df=fdrdf = \nabla f dr

What is formula is saying is how much the function will change according to the displacement drdr. If we are in an 3D space then dr=(dx,dy,dz)dr = (dx,dy,dz).

Since f\nabla f is a vector that point to the steepest increase of the function.

If the displacement is towards the vector f\nabla f then df is going to be bigger, while if the displacement is in the opposite direction the change of the function is going to be little. If the displacement is orthogonal to the f\nabla f vector, there is going to be no change in the the function, this is for the definition of dot product.

The gradient for a scalar function at a point pp is defined as:

f(p)=[fx1(p)fxn(p)](13){\displaystyle \nabla f(p)={\begin{bmatrix}{\frac {\partial f}{\partial x_{1}}}(p)\\\vdots \\{\frac {\partial f}{\partial x_{n}}}(p)\end{bmatrix}}} \tag{13}

This definition works only if the function is differentiable in p, otherwise obviously we cannot do the partial derivates.

By the definition of gradient follows that if we want to minimize our scalar function, that quantifies how far the model’s prediction h(x)h(x) is from the true output yy for a single data point, then we have to move in the opposite direction of the gradient from that point pp.

The update rule for the gradient descend is:

an+1=anηf(an)a_{n+1} = a_{n} - \eta\nabla f(a_{n})

η\eta is the learning rate, is represent how long is the step you are going to take towards the function local minimum, ana_{n} is the point you are evaluating the gradient, from that point you move to the direction of the fastest decrease of the function.

How can we apply gradient descent to learn in a neural network? The idea is to use gradient descent to find the weights wkw_{k} and biases blb_{l} which minimize the cost in Equation. To see how this works, let’s restate the gradient descent update rule, with the weights and biases replacing the variables vjv_{j}. In other words, our “position” now has components wkw_{k} and blb_{l}, and the gradient vector C\nabla C has corresponding components Cwk\frac{\partial C}{\partial w_{k}} and Cb\frac{\partial C}{\partial b}. Writing out the gradient descent update rule in terms of components, we have the update rule for the bias:

b1b1=b1ηCb1(14) b_{1} \to b_{1}^{′} = b_{1} - \eta \frac{\partial C}{\partial b_{1}} \tag{14}

And the update rule for the weights of every single layer:

wkwk=wkηCwk(15)w_{k} \to w_{k}^{′} = w_{k}-\eta \frac{\partial C}{\partial w_{k}} \tag{15}

By applying this rule we can “roll down the hill” and hopefully find a minimum of the cost function. Of course we cannot be sure that the minimum we’ll find will be the global minimum of the function, but just a local minimum. The lesson is that we cannot be sure if there is a better configuration of weights and biases that make our model predict better.

A better implementation of gradient descend is stochastic gradient descend, as show in the chapter of the cost function, we have defined a cost function with multiple sample. The idea of stochastic gradient descend is to estimate the gradient C\nabla C by computing  Cx\nabla C_{x} for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient C\nabla C, and this helps speed up gradient descent, and thus learning.

We are calling those random input training samples (X1,X2,,Xm)(X_{1},X_{2},\dots,X_{m} ) we will call them mini-batch. The mm must be large enough to have a roughly similar result for the gradient. So that CXj\nabla C_{X_{j}} and CX\nabla C_{X} will be almost the same, but the gradient will be much easier to compute.

If this is true so we have:

j=1mCXjmXCXn=C\frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_{X} \nabla C_{X}}{n} =\nabla_{} C

Where the second sum is over all the set of training data, so we get the approximation used for stochastic gradient descend:

j=1mCXjmC\frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \nabla C

Taking this into account the update rule for stochastic descend becomes:

wk=wkηmjCXjwk(16)w_k' = w_k - \frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{16} bl=blηmjCXjbl(17)b_l' = b_l - \frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \tag{17}

Let’s make a concrete example on how a neural network is trainer using stochastic gradient descend:

Start of Training:

  • Your model begins with random weights and biases.
  • It doesn’t know anything yet — its predictions are probably garbage.

Epoch 1:

  • Shuffle the full training set (e.g. 100,000 examples).
  • Break it into mini-batches (e.g. 100,000 examples → 1,562 mini-batches of 64 examples).

For each mini-batch:

  1. Backpropagation step: compute the gradients \to how much each weight and bias should change to reduce the error.
  2. Update the weights and biases using the SGD formulas:

Epoch 2:

  • Shuffle the data again (important to avoid fixed patterns).
  • Repeat the mini-batch training process.
  • Now, weights and biases are slightly better.
  • The model should start making better predictions.

Repeat for Many Epochs (e.g., 10, 50, 100):

  • With each epoch, the model sees all training examples again (in new order).
  • Parameters are updated gradually.
  • The cost function ( C ) (overall error) should decrease over time.
  • The model becomes better at:
    • Generalizing
    • Predicting accurately
    • Learning meaningful patterns

Algorithm and implementation

The algorithm for backpropagation is composed of three phases:

  1. Input a sample of training example
  2. For each training sample:
    1. Feedforward phase
    2. Compute output error
    3. Backpropagate the error
  3. Gradient descend

2) phase

We need to find the quantities Cwl\frac{\partial C}{\partial w^l} and Cbl\frac{\partial C}{\partial b^l} for all layers ll.

So for each layer ll starting from the last one and moving backwards we have to compute first:

δL:=CaL\delta^L :=\frac{\partial C}{\partial a^L}

And then propagate the error backwards:

δl=(al+1al)δl+1,for l=L,L1,L2,,1(11)\delta^l = \left( \frac{\partial a^{l+1}}{\partial a^l} \right)^\top \cdot \delta^{l+1}, \quad \text{for } l = L, L - 1, L - 2, \dots, 1 \tag{11}

And for each layer compute what we are looking for as:

Cwl=alwliδl(12)\frac{\partial C}{\partial w^l} = \frac{\partial a^l}{\partial w^l} \overset{i}{\cdot} \delta^l \tag{12}

And:

Cbl=δl\frac{\partial C}{\partial b^l} = \delta^l

3) phase

Once computed these derivatives we have to use gradient descend as:

wk=wkηmjCXjwk(16)w_k' = w_k - \frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{16} bl=blηmjCXjbl(17)b_l' = b_l - \frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \tag{17}

This is the big summary of the backpropagation algorithm of his most general form, if we want to implement this algorithm we need to decide which activation function and cost function we are going to use. We are going to use the sigmoid function:

σ(x)=11+ex,withσ(x)=σ(x)(1σ(x))\sigma(x)=\frac{1}{1+e^{-x}},\quad \text{with} \quad \sigma^{′}(x)=\sigma(x)(1-\sigma(x))

Meanwhile as cost function the squared error:

C=12aLy2,C = \frac{1}{2} \| a^L - y \|^2,

I’m going to use the code of Micheal Nielsen you can find on his book, however you can find the complete code on my github page used with training samples and MINST dataset.

Before write the code we need to understand how the equations seen on phase 2 of the algorithm changes if we use the sigmoid function and the squared error as cost function.

We start computing δL\delta^L the error for the last layer we defined as:

δL:=CaL\delta^L :=\frac{\partial C}{\partial a^L}

Now if we take the derivative of the squared error with respect to aLa^L we obtain (check for yourself):

CaL=aLy\frac{\partial C}{\partial a^L}=a^L-y

the error term at the output layer is defined as

δL=CzL.\delta^L = \frac{\partial C}{\partial z^L}.

There is no contradiction in defining the output layer error as

δL:=CaL,\delta^L := \frac{\partial C}{\partial a^L},

because this is a valid partial derivative that describes how the cost changes with respect to the activation at layer LL. However, during backpropagation, we ultimately need the derivative with respect to zLz^L, since zLz^Ldepends directly on the weights. By applying the chain rule, we refine our expression:

δL=CzL=CaLaLzL.\delta^L = \frac{\partial C}{\partial z^L} = \frac{\partial C}{\partial a^L} \odot \frac{\partial a^L}{\partial z^L}.

Since we now that CaL=aLy\frac{\partial C}{\partial a^L} = a^L - y, and we know the form of the derivative of the activation function:

aLzL=σ(zL)=aL(1aL),\frac{\partial a^L}{\partial z^L} = \sigma'(z^L) = a^L \odot (1 - a^L),

this leads us to the final expression for the error on the last layer.

δL=(aLy)aL(1aL).\delta^L = (a^L - y) \odot a^L \odot (1 - a^L).

Now we need to propagate the error back recursively using the rule:

δl=(al+1al)δl+1,for l=L,L1,L2,,1(11)\delta^l = \left( \frac{\partial a^{l+1}}{\partial a^l} \right)^\top \cdot \delta^{l+1}, \quad \text{for } l = L, L - 1, L - 2, \dots, 1 \tag{11}

What form does this take with the sigmoid activation function?

al+1=σ(zl+1)=σ(Wl+1al+bl+1)al+1al=σ(zl+1)al=σ(zl+1)zl+1zl+1al\begin{align*} a^{l+1} &= \sigma(z^{l+1}) \\ &= \sigma(W^{l+1} a^l + b^{l+1}) \\ \\ \frac{\partial a^{l+1}}{\partial a^l} &= \frac{\partial \sigma(z^{l+1})}{\partial a^l} \\ &= \frac{\partial \sigma(z^{l+1})}{\partial z^{l+1}} \cdot \frac{\partial z^{l+1}}{\partial a^l} \end{align*}

In the last term I’ve just applied the chain rule. We know:

zl+1=Wl+1al+bl+1z^{l+1} = W^{l+1} a^l + b^{l+1}

Since this is a linear function of ala^l, taking it’s derivative the second term of the last expression becomes:

zl+1al=Wl+1\frac{\partial z^{l+1}}{\partial a^l} = W^{l+1}

Activations are applied element by element:

al+1=σ(zl+1)a^{l+1} = \sigma(z^{l+1})

So the derivative is a diagonal matrix with the derivative of σ\sigma applied to each entry:

σ(zl+1)zl+1=diag(σ(zl+1))\frac{\partial \sigma(z^{l+1})}{\partial z^{l+1}} = \mathrm{diag}(\sigma'(z^{l+1}))

This matrix has σ(zi)\sigma'(z_i) on the diagonal and zeros elsewhere. So adding all up together, the transpose becomes:

(al+1al)=(diag(σ(zl+1))(Wl+1))\left( \frac{\partial a^{l+1}}{\partial a^l} \right)^\top = \left(\mathrm{diag}(\sigma'(z^{l+1})) \cdot (W^{l+1})^\top \right)

We have not finished yet, we have to multiply this by the error of δl+1=Cal+1\delta^{l+1}=\frac{\partial C}{\partial a^{l+1}}.

But multiplying a diagonal matrix by a vector is just an elementwise product!

σ(zl+1)((Wl+1)Cal+1)\sigma'(z^{l+1}) \odot \left( (W^{l+1})^\top \cdot \frac{\partial C}{\partial a^{l+1}} \right)

This gives us the clean recursive formula we love in backpropagation. So:

δl=(Wl+1)δl+1σ(zl)\delta^l = (W^{l+1})^\top \cdot \delta^{l+1} \odot \sigma'(z^l)

That we can write also (remembering for form of the first derivative of the activation function):

δl=(Wl+1)δl+1al(1al)\delta^l = (W^{l+1})^\top \cdot \delta^{l+1} \odot a^l \odot (1 - a^l)

Gradient of the weights

Once we have the error vector δl\delta^l, we can compute the gradient of the cost with respect to the weights:

Cwl=alwliδl\frac{\partial C}{\partial w^l} = \frac{\partial a^l}{\partial w^l} \overset{i}{\cdot} \delta^l

Let’s examine a single element of this expression:

ailwjkl=σ(zil)wjkl=σ(zil)zilwjkl\frac{\partial a^l_i}{\partial w^l_{jk}} = \frac{\partial \sigma(z^l_i)}{\partial w^l_{jk}} = \sigma'(z^l_i) \cdot \frac{\partial z^l_i}{\partial w^l_{jk}}

Since:

zil=kwiklakl1+bilz^l_i = \sum_k w^l_{ik} a^{l-1}_k + b^l_i

Then:

zilwjkl={akl1if i=j0otherwise\frac{\partial z^l_i}{\partial w^l_{jk}} = \begin{cases} a^{l-1}_k & \text{if } i = j \\ 0 & \text{otherwise} \end{cases}

So:

ajlwjkl=σ(zjl)akl1\frac{\partial a^l_j}{\partial w^l_{jk}} = \sigma'(z^l_j) \cdot a^{l-1}_k

Multiplying by the error:

Cwjkl=δjlakl1\frac{\partial C}{\partial w^l_{jk}} = \delta^l_j \cdot a^{l-1}_k

Switching to matrix form the full gradient is given by:

Cwl=δl(al1)\frac{\partial C}{\partial w^l} = \delta^l (a^{l-1})^\top

This is the outer product between the error vector at layer ll and the activations from the previous layer.

What about the bias blb^l?

The bias appears in the pre-activation like this:

zl=Wlal1+blz^l = W^l a^{l-1} + b^l

So when we differentiate with respect to blb^l, it’s even simpler.

We know:

al=σ(zl)=σ(Wlal1+bl)a^l = \sigma(z^l) = \sigma(W^l a^{l-1} + b^l)

This means the bias blb^l directly and linearly affects the pre-activation zlz^l.
That is:

zlbl=I\frac{\partial z^l}{\partial b^l} = I

(the identity matrix).

So, using the chain rule:

Cbl=zlblCzl=Iδl=δl\frac{\partial C}{\partial b^l} = \frac{\partial z^l}{\partial b^l} \cdot \frac{\partial C}{\partial z^l} = I \cdot \delta^l = \delta^l

The gradient of the cost with respect to the bias is simply the error vector:

Cbl=δl\frac{\partial C}{\partial b^l} = \delta^l

Conclusion

Key Takeaway: From General Backpropagation to Implementation

Starting from the most general backpropagation equations, we derived the specific forms used in practice, depending on the activation and cost functions chosen. Here is a concise summary of how the expressions evolve:


Output Layer Error

δL:=CaLδL=(aLy)aL(1aL)\delta^L := \frac{\partial C}{\partial a^L} \quad \to \quad \delta^L = (a^L - y) \odot a^L \odot (1 - a^L)

Recursive Error for Hidden Layers

δl=(al+1al)δl+1δl=(Wl+1)δl+1al(1al)\delta^l = \left( \frac{\partial a^{l+1}}{\partial a^l} \right)^\top \cdot \delta^{l+1} \quad \to \quad \delta^l = (W^{l+1})^\top \cdot \delta^{l+1} \odot a^l \odot (1 - a^l)

Gradient of Weights

Cwl=alwliδlCwl=δl(al1)\frac{\partial C}{\partial w^l} = \frac{\partial a^l}{\partial w^l} \overset{i}{\cdot} \delta^l \quad \to \quad \frac{\partial C}{\partial w^l} = \delta^l (a^{l-1})^\top

Gradient of Biases

Cbl=δl\frac{\partial C}{\partial b^l} = \delta^l

This summary highlights the transition from theoretical derivatives to practical formulas used in neural network training.

Implementation

Now we can finally dive into the implementation of these algorithm using the equations found above, you can find the complete code on my github page, indeed below you will find a brief explanation of the key concepts.

Setup

We are going to setup the data structures used in the class, the methods np.random,randn(x,y) creates a structure of the shape indicated inside the parenthesis, created randomly.

self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]

The zip function connect two turples, for example:

names = ["Simon", "Erik"]
scores = [15,22]
res = zip(names, scores)
print(res)

#the result would be
-->[("Simon", 15), ("Erik",22)]

For the biases we are creating a matrix of dimentions y×1y \times 1, we are not assigning it a simple vector because with a matrix will be simpler doing later operation, like calculating the activation of a neuron.

the code sizes[1:] is used to ignore the first layer (the input layer of the network), this is because we don’t want to assign biases this this layer.

In the self.weights part of code we are randomly assigning the weights of the network, the code sizes[:-1], sizes[1:] is used to take into account all the layers excluding the first and the last one. If for example our sizes turple is [1,2,1] we would have two matrixes of dimentions: 2×12 \times 1 and 1×21 \times 2:

(w1,11w2,11) and (w1,12w1,22)\begin{pmatrix} w_{1,1}^1 \\ w_{2,1}^1 \end{pmatrix} \text{ and } \begin{pmatrix} w_{1,1}^2 w_{1,2}^2 \end{pmatrix}

(remember the notation for weights of a network I’ve introduced at the beginning of this article).

This is just the introductory code, that set up in a random way the weights and biases of the network.

Then we have the main function of this algorithm update_mini_batch, is the function that update the weights and biases by applying gradient descend using backpropagation.

def update_mini_batch(self, mini_batch, eta):


	nabla_b = [np.zeros(b.shape) for b in self.biases]
	nabla_w = [np.zeros(w.shape) for w in self.weights]

  

for x, y in mini_batch:

	delta_nabla_b, delta_nabla_w = self.backprop(x, y)

	nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
	nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]

	self.weights = [w-(eta/len(mini_batch))*nw 
					for w, nw in zip(self.weights, nabla_w)]

	self.biases = [b-(eta/len(mini_batch))*nb

					for b, nb in zip(self.biases, nabla_b)]

The first two line are used to set up a data structure where the gradient for the biases and weights are going to be calculated, np.zeros(b.shapes) creates a numpy array of zeros with the shape of b.

The key part of this function is the loop, that for each value of the numpy array mini_batch, each item is composed of an input (x) and the expected output (y).

The most important line of code is delta_nabla_b, delta_nabla_w = self.backprop(x, y) because it calls the function backpropagation that gives back for each layer ll:

δbl=Cbl\delta\nabla b^l = \frac{\partial C}{\partial b^{l}}

And

δwl=Cwl\delta\nabla w^l = \frac{\partial C}{\partial w^{l}}
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]

Those two lines are used to accumulate the gradients (partial derivatives) for each layer in a neural network during mini-batch stochastic gradient descent (SGD).

Then we can update the weights and biases according to the equations we have previously analized:

wk=wkηmjCXjwkw_k' = w_k - \frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} bl=blηmjCXjblb_l' = b_l - \frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l}
self.weights = [w-(eta/len(mini_batch))*nw 
	for w, nw in zip(self.weights, nabla_w)]

self.biases = [b-(eta/len(mini_batch))*nb
	for b, nb in zip(self.biases, nabla_b)]

Finally we are able to understand how this piece of code really work, and I’m going to explain in detail.

def backprop(self, x, y):

	nabla_b = [np.zeros(b.shape) for b in self.biases]
	nabla_w = [np.zeros(w.shape) for w in self.weights]

	# feedforward
	activation = x
	activations = [x] # list to store all the activations, layer by layer
	zs = [] # list to store all the z vectors, layer by layer

	for b, w in zip(self.biases, self.weights):
		z = np.dot(w, activation)+b
		zs.append(z)
		activation = sigmoid(z)
		activations.append(activation)

# backward pass
	delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])

	nabla_b[-1] = delta
	nabla_w[-1] = np.dot(delta, activations[-2].transpose())

	for l in range(2, self.num_layers):
		z = zs[-l]
		sp = sigmoid_prime(z)
		delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
		nabla_b[-l] = delta
		nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())

	return (nabla_b, nabla_w)

These first two lines as in the update_mini_batch function are used to set up the shape of the the numpy array that we are going to return at the end of the function.

nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]

The next piece of code is the feedforward phase, a key part of the process, we need to know the activation of every single layer if we want to backpropagate the error. Remember from before how we defined the activation for a layer \ell.

a=σ(Wa1+b) a^{\ell}=\sigma\left( W^{\ell}a^{\ell-1} + b^{\ell} \right)

The input x stores the activation of the first layer of the network while the activation array is going to store all the activations arrays of the network. Remember also how we defined the middle quantity z z:

z=Wa1+b z^{\ell} = W^{\ell}a^{\ell-1} + b^{\ell}
# feedforward
	activation = x
	activations = [x] # list to store all the activations, layer by layer
	zs = [] # list to store all the z vectors, layer by layer

	for b, w in zip(self.biases, self.weights):
		z = np.dot(w, activation)+b
		zs.append(z)
		activation = sigmoid(z)
		activations.append(activation)

Remembering that it’s easy to understand what this code is actually doing, is just applying the dot product between matrixes and adding the bias, and applying the activation function.

Here comes the key part of the algorithm, how do we write the code for a backward pass? First of all remember we need the compute the error of the last layer as:

δL=(aLy)aL(1aL)\delta^L = (a^L - y) \odot a^L \odot (1 - a^L)

And we compute also the partial derivative for the weights and biases for the last layer.

CwL=δL(aL1)\frac{\partial C}{\partial w^L} = \delta^L (a^{L-1})^\top CbL=δL \frac{\partial C}{\partial b^L} = \delta^L

Then we repeat this for all the layers until the first one, we compute the activation of the layer, calculate the error thank to the backpropagation equation:

δl=(Wl+1)δl+1al(1al) \delta^l = (W^{l+1})^\top \cdot \delta^{l+1} \odot a^l \odot (1 - a^l)

And compute again the gradient of weights and biases, we’ll stop until we have reached the first layer and return the result.

# backward pass
	delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])

	nabla_b[-1] = delta
	nabla_w[-1] = np.dot(delta, activations[-2].transpose())

	for l in range(2, self.num_layers):
		z = zs[-l]
		sp = sigmoid_prime(z)
		delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
		nabla_b[-l] = delta
		nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())

	return (nabla_b, nabla_w)

References