7  Neural Networks

Now that we constructed one of the simplest NNs possible (if we can even call it a NN), we build on this and extend our model. The first step is to introduce some (non-linear) activation functions or transfer functions \[ y = f(A, b, x). \]

Some common functions are:

  1. linear \[ f(x) = x. \]

  2. binary step \[ f(x) = \begin{cases} 0 & \text{for}\quad x \leq 0,\\ 1 & \text{for}\quad x > 0.\\ \end{cases} \]

  3. logistic (soft step) or sigmoid \[ f(x) = \frac{1}{1 + \exp (-x)}. \]

  4. tanh \[ f(x) = \tanh (x). \]

  5. rectified linear unit (ReLU) \[ \begin{cases} 0 & \text{for}\quad x \leq 0,\\ x & \text{for}\quad x > 0.\\ \end{cases} \]

We can visualize them and their derivatives (we need them later).

Show the code for the figure
import numpy as np
import sklearn
import scipy
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

import scipy.differentiate

x = np.linspace(-4, 4, 1025, endpoint=True)
linear = lambda x: x
binary = lambda x: np.heaviside(x, 1)
logistic = lambda x: 1 / (1 + np.exp(-x))
tanh = lambda x: np.tanh(x)
relu = lambda x: np.maximum(0, x)

plt.figure()
plt.plot(x, linear(x), "-", label="Linear")
plt.plot(x, binary(x), ":", label="Binary")
plt.plot(x, logistic(x),"--",  label="Logistic")
plt.plot(x, tanh(x), "-.", label="tanh")
plt.plot(x, relu(x), "-.", label="ReLU",
         marker="o", markevery=100, markersize=3)
plt.legend()
plt.xlim([-4, 4])
plt.ylim([-2, 2])
plt.gca().set_aspect(8 / (4 * 3))

plt.figure()
plt.plot(x, scipy.differentiate.derivative(linear, x).df, "-", label="Linear")
plt.plot(x, scipy.differentiate.derivative(binary, x).df, ":", label="Binary")
plt.plot(x, scipy.differentiate.derivative(logistic, x).df,"--",  label="Logistic")
plt.plot(x, scipy.differentiate.derivative(tanh, x).df, "-.", label="tanh")
plt.plot(x, scipy.differentiate.derivative(relu, x).df, "-.", label="ReLU",
         marker="o", markevery=100, markersize=3)
plt.legend()
plt.xlim([-4, 4])
plt.ylim([-0.1, 1.1])
plt.gca().set_aspect(8 / (1.2 * 3))
plt.show()
(a) Activation functions.
(b) Numerical derivates of the activation functions.
Figure 7.1: Different activation functions and their derivatives.

7.1 The trainable parameters of a Neural Network

Before we build our first NN in the next section, we need to get a better understanding on the dimensions and available parameters for our optimization involved in training a NN. As discussed, these parameters are the matrices combining the weights and the biases. Of course they are related to the neurons.

Figure 7.2: Generic linear NN with weights and biases.

Exercise 7.1 (Compute the parameters of the Neural Network) For the NN display in Figure 7.2 answer the following questions:

  1. What are the shapes of the matrices \(A_1, A_2, A_3, A_4\)?
  2. Which of these matrices are sparse (have multiple zeros)?
  3. What are the shapes of the biases \(b_1, b_2, b_3, b_4\)?
  4. Write down the (expanded) formula for the output \(y\) with respect to the input \(x\) resulting from the composition of the different layers for \(f(x) = x\) in each layer (as seen in Figure 7.2).
  5. Is this formulation a good option to compute \(y\) from \(x\) (provide some reasoning)?

In Figure 7.2 we also used the formulation of hidden layers for all the layers inside the NN. This is a common formulation reflecting the nature of the NN. Their activations (transfer from their specific input to output \(f_j(A_j, b_j, x^{(j-1)})\)) are not exposed to the user and can’t be observed by the user directly.

Now let us build our first NN.

7.2 A Neural Network with pytorch

Important

There are multiple frameworks available for NNs in Python. We will focus on pytorch for these notes.

Nevertheless, see Appendix B for an implementation of the same NN with an alternative framework.

Note

There are a couple of steps required and some foreshadow later decisions. We tried to make the split to provide better understanding and an easy way to follow the material.

Let us start with the input and how to prepare our data for for the model to-be. Before we load everything into the pytorch specific structures, we split our data into a training and test set. This time it is important that we rescale scale our image to \([0, 1]\), otherwise the optimization algorithms perform poorly. We scale by \(255\), our theoretical maximum in a generic image.

import numpy as np
import scipy
import requests
import io
import matplotlib.pyplot as plt
import torch
from torch.utils.data import TensorDataset
%config InlineBackend.figure_formats = ["svg"]
# Initialize everything for reproducibility
torch.manual_seed(6020)
np.random.seed(6020)

response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/catData_w.mat")
cats_w = scipy.io.loadmat(io.BytesIO(response.content))["cat_wave"]

response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/dogData_w.mat")
dogs_w = scipy.io.loadmat(io.BytesIO(response.content))["dog_wave"]

s = 40
X_train = np.concatenate((dogs_w[:, :s], cats_w[:, :s]), axis=1).T / 255.0
y_train = np.repeat(np.array([0, 1]), s)
X_test = np.concatenate((dogs_w[:, s:], cats_w[:, s:]), axis=1).T / 255.0
y_test = np.repeat(np.array([0, 1]), 80 - s)

1X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.uint8)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.uint8)
2dataset = TensorDataset(X_train_tensor, y_train_tensor)
1
We need to convert the training data to PyTorch tensors unfortionalty, this might result in larger data if we are not careful, uint8 works for the loss computation we aim for (see later).
2
Combine \(X\) and \(Y\) such that it can easily be used for a DataLoader.

The easiest way to provide data for training is to use the mentioned DataLoader class. As we have only a very limited number of training data, we split our available data into a training and validation. We do so new and random in each epoch (optimization step). To provide a better overview, we use a dedicated function for this procedure. This function will be called during each optimization step in our training.

from torch.utils.data import DataLoader, random_split


def get_dataset(dataset: TensorDataset, val_split: float, batch_size: int):
    val_size = int(val_split * len(dataset))
    train_size = len(dataset) - val_size
    train_ds, val_ds = random_split(dataset, [train_size, val_size])

    # Create a dataset
    train = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    val = DataLoader(val_ds, batch_size=batch_size, shuffle=True)
    return train, val

Now that we have taken care of our input, we need to discuss the output of our model. Instead of having a single neuron that is zero or one, we will have two neurons, i.e. \(y\in\mathbb{R^2}\) using a so called one-hot encoding for our two classes. \[ y = \left[\begin{array}{c}1 \\ 0\end{array}\right] = \text{dog}, \quad\text{and}\quad y = \left[\begin{array}{c}0 \\ 1\end{array}\right] = \text{cat}. \]

The idea is, that the model will provide us with the probability, that an image shows a dog on \(y_1\) or a cat on \(y_2\) (the sum of both will be \(1\)).

Next, we define our activation function in the first layer by the non-linear function \(f_1 = \tanh\). The idea is to use a more complicated function the reflect the nature of our images. The rest remains unchanged, with the weights aggregated to matrix \(A_1\) and the bias to the vector \(b_1\). In a the next layer we have a linear transform to provide some more freedom and than use the softmax function \(\sigma\) (see Section A.3 for a more detailed explanation and example) to translate the result into a probability \[ \sigma: \mathbb{R^n} \to (0, 1)^n, \quad \sigma(x)_i = \frac{\exp(x_i)}{\sum_j{\exp(x_j)}}. \]

We visualized the resulting network in Figure 7.3.

Figure 7.3: A two layer structure for the cats vs. dogs classification with a non-linear model.

The following couple of code sections translate Figure 7.3 into a pytorch model. The main idea is that we create a model class that inherits from torch.nn.Module and than perform our training on this class, benefiting from the inherited capabilities.

Tip

As we will see in Exercise 7.9, softmax and our loss function can be combined in a numerical stable way. Therefore, it is often advised not to include softmax during training but for inference.

For the following code snippet, we keep it in as a comment to show where it would be included.

import torch


1class MyFirstNN(torch.nn.Module):
2    def __init__(self, input_params):
3        super(MyFirstNN, self).__init__()
4        self.model = torch.nn.Sequential(
            torch.nn.Linear(input_params, 2),           
            torch.nn.Tanh(),                            
            torch.nn.Linear(2, 2),                      
            #torch.nn.Softmax(dim=1),                    
        )

5    def forward(self, x):
        y = self.model(x)
        return y
1
Create a class inheriting from torch.nn.Module.
2
The layers of the NN are defined in the initialization of the class.
3
Do not forget to initialize the super class.
4
We define a sequential model, this reflects best our desired structure, the first layer that reduces down to two neurons and applies the function \(\tanh\), and a second for softmax.
5
In forward we define how data moves through the network.

Of course it is important to check if the code corresponds to the model we have in mind. For this torchinfo is quite a useful tool. It provides a tabular overview.

import torchinfo

model = MyFirstNN(X_train.shape[1])
batch_size = 8
info = torchinfo.summary(model, (batch_size, X_train.shape[1]), col_width=12)
# Nicer formatting for the notes
info.layer_name_width=15
print(info)
================================================================
Layer (type:depth-idx)                   Output Shape Param #
================================================================
MyFirstNN                                [8, 2]       --
├─Sequential: 1-1                        [8, 2]       --
│    └─Linear: 2-1                       [8, 2]       2,050
│    └─Tanh: 2-2                         [8, 2]       --
│    └─Linear: 2-3                       [8, 2]       6
================================================================
Total params: 2,056
Trainable params: 2,056
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 0.02
================================================================
Input size (MB): 0.03
Forward/backward pass size (MB): 0.00
Params size (MB): 0.01
Estimated Total Size (MB): 0.04
================================================================

Exercise 7.2 (Explain the output of .summary()) For the (extended) output of .summary() in the following listing answer the questions below.

Show the code for the slightly extended summary.
import torchinfo

model = MyFirstNN(X_train.shape[1])
batch_size = 8
info = torchinfo.summary(model, (batch_size, X_train.shape[1]),
                  col_names=["input_size", "output_size", "num_params"],
                  col_width=12)
info.layer_name_width=15
print(info)

print(f"\n\nThe estimated total size in KB {info.total_param_bytes / 1024}")
============================================================================
Layer (type:depth-idx)                   Input Shape  Output Shape Param #
============================================================================
MyFirstNN                                [8, 1024]    [8, 2]       --
├─Sequential: 1-1                        [8, 1024]    [8, 2]       --
│    └─Linear: 2-1                       [8, 1024]    [8, 2]       2,050
│    └─Tanh: 2-2                         [8, 2]       [8, 2]       --
│    └─Linear: 2-3                       [8, 2]       [8, 2]       6
============================================================================
Total params: 2,056
Trainable params: 2,056
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 0.02
============================================================================
Input size (MB): 0.03
Forward/backward pass size (MB): 0.00
Params size (MB): 0.01
Estimated Total Size (MB): 0.04
============================================================================


The estimated total size in KB 8.03125
  1. Explain how the Param # column is computed and retrace the details.
  2. If we expand the Params size to (KB) we get 8.03, what does this imply on the data type of the parameters?

It is often also quite useful to export the model as an image. Here the torchviz package comes in handy. In our case we can work again with a dot file, that is rendered as Figure 7.4.

import torchviz
model.eval()

x_viz = X_train_tensor[0, :].reshape([1, -1])
y_viz = model(x_viz)
dot = torchviz.make_dot(y_viz, params=dict(model.named_parameters()))
dot.save("model.dot")
140542198564976 (1, 2) 140542179113568 AddmmBackward0 140542179113568->140542198564976 140545407364368 AccumulateGrad 140545407364368->140542179113568 140542198803312 model.2.bias (2) 140542198803312->140545407364368 140542207684912 TanhBackward0 140542207684912->140542179113568 140546900512768 AddmmBackward0 140546900512768->140542207684912 140542197598000 AccumulateGrad 140542197598000->140546900512768 140542198802912 model.0.bias (2) 140542198802912->140542197598000 140542206793360 TBackward0 140542206793360->140546900512768 140542179115120 AccumulateGrad 140542179115120->140542206793360 140542198803072 model.0.weight (2, 1024) 140542198803072->140542179115120 140542179098736 TBackward0 140542179098736->140542179113568 140542204758624 AccumulateGrad 140542204758624->140542179098736 140542198803232 model.2.weight (2, 2) 140542198803232->140542204758624
Figure 7.4: Model with all its parameters visualized.

In pytorch we are responsible for our training function. This allows for a lot of flexibility. The structure is always quite similar.

def train_model(model, dataset, loss_fn, optimizer,
                epochs=250, validation_split=0.1, 
                batch_size=8):

1    met = {
        "train_loss": [],
        "val_loss": [],
        "train_acc": [],
        "val_acc": []}

2    for epoch in range(epochs):
3        model.train()
        train_loss, train_corr = 0, 0
        train_dl, val_dl = get_dataset(dataset, validation_split, batch_size)

4        for X_batch, y_batch in train_dl:
            y_pred = model(X_batch)            # Forward pass through the model
            loss = loss_fn(y_pred, y_batch)    # Compute the loss
            loss.backward()                    # Backpropagation
            optimizer.step()                   # Update the model parameters
            optimizer.zero_grad()              # Reset the gradients

            train_loss += loss.item()
            train_corr += (y_pred.argmax(1) == y_batch).sum().item()

5        model.eval()
        val_loss, val_corr = 0, 0                                       
        with torch.no_grad():                   # No gradient calculation
6            for X_val, y_val in val_dl:
                y_val_pred = model(X_val)
                val_loss += loss_fn(y_val_pred, y_val).item()
                val_corr += (y_val_pred.argmax(1) == y_val).sum().item()

7        met["train_loss"].append(train_loss / len(train_dl))
        met["val_loss"].append(val_loss / len(val_dl))
        met["train_acc"].append(train_corr / len(train_dl.dataset))
        met["val_acc"].append(val_corr / len(val_dl.dataset))

    return met
1
Create a dict for our metrics.
2
The training loop for each epoch.
3
The model needs to be in the train mode, otherwise the parameters can not be changed.
4
The actual training, the data is processed in batches and the optimization is computed for each batch.
5
To allow validation, the model needs to be in eval mode.
6
Validation step, that computes the necessary metrics for our validation set.
7
Keep track of the metrics of our training. We log the average loss and the accuracy.

Now all parts are available and we can finally train our model.

1loss_fn = torch.nn.CrossEntropyLoss()
2optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)

epochs = 120
3history = train_model(model, dataset, loss_fn, optimizer,
                      epochs=epochs,
                      validation_split=0.1,
                      batch_size=batch_size)
1
Define the loss function.
2
Define the optimizer.
3
Call the training loop.

This looks like a lot, but actually it is boiler plate code if needed and allows for quite a lot of flexibility if required.

Now that training is complete we can have a look at the performance.

1model.eval()
with torch.no_grad():
2    y_proba = torch.nn.Softmax(dim=1)(model(X_test_tensor))
3y_predict = y_proba.argmax(axis=-1)
4acc = (y_predict == y_test_tensor).sum().item() / len(y_test)
1
Switch to evaluation model.
2
Apply softmax to the output of the model.
3
Convert the probability into a class.
4
Compute the accuracy for the test images.

We can also visualize the findings for a better understanding.

Show the code for the figure
def myplot(y):
    plt.figure()
    n = y.shape[0]
    plt.bar(range(n), y)
    plt.plot([-0.5, n - 0.5], [0, 0], "k", linewidth=1.0)
    plt.plot([n // 2 - 0.5, y.shape[0] // 2 - 0.5], [-1.1, 1.1], "r-.", linewidth=3)
    plt.yticks([-0.5, 0.5], ["cats", "dogs"], rotation=90, va="center")
    plt.text(n // 4, 1.05, "dogs")
    plt.text(n // 4 * 3, 1.05, "cats")
    plt.gca().set_aspect(n / (2 * 3))


myplot(y_predict * (-2) + 1)

n = y_proba.shape[0]
plt.figure()
plt.bar(range(n), y_proba[:, 0], color='y', label=r"p(dog)")
plt.bar(range(n), y_proba[:, 1], color='b', label=r"p(cat)", 
        bottom=y_proba[:, 0])
plt.plot([-0.5, n - 0.5], [0.5, 0.5], "r", linewidth=1.0)
plt.gca().set_aspect(n / (1 * 3))
plt.legend()

plt.figure()

epochs_range = range(len(history["train_acc"]))
plt_list = [["train_acc", "-"], ["train_loss", ":"],
            ["val_acc", "--"], ["val_loss", "-."]]

for name, style in plt_list:
    plt.plot(epochs_range, history[name], style, label=name)
plt.legend(loc="lower right")
plt.gca().set_aspect(epochs / (1 * 3))
plt.show()
(a) Classification for our test set.
(b) Probabilities of the two classes - our actual output of the model.
(c) Summary of the key metrics of the model training.
Figure 7.5: Performance of our model.

In Figure 7.5 (a) we can see the final classification of our model with regards to the test set. For 11 dogs our model is convinced they are not good dogs but cats, and 3 cats are classified as dogs. If we look at the probabilities Figure 7.5 (b), we can see that we have a couple of close calls, but in general our model is quite sure about the classification. Regarding the history of our optimization, we can see three phases, first our accuracy stays constant, right about for the first 80 iterations. Than the network starts learning up to 120 and after that only the loss function declines, but accuracy stays high.

At the end, we have an accuracy of 82.5% for our test set, a bit better than with our linear models Figure 7.1.

Exercise 7.3 (Learning rate and momentum) The used optimizer SGD has two options we want to investigate further. The learning rate lr and the momentum.

  1. Change the learning rate and see how this influences performance, you might want to increase epochs as well. Try \(lr \in [10^{-1}, 10^{-4}]\).
  2. Change the momentum between \(0\) and \(1\), can this improve the predictions for different learning rates and such that the NN is more sure of its decision (in Figure 7.5 (b) the bars are not almost equal but lean to one side).

Exercise 7.4 (Change the optimizer) As we have seen so often before, the optimizer used for the problem can change how fast convergence is reached (if at all).

  1. Have a look at the possibilites provided in the framework Overview - Optimizers - they are basically all improvements on Stochastic Gradient Descent.

  2. Test different approaches, especially one of the Adam (Adam, AdamW, Adamax) and Ada (Adadelta, Adafactor, Adagrad, SparseAdam) implementations and record the performance (final accuracy).

Exercise 7.5 (Train and validation split) In the above version of the train loop we split our dataset in each epoch. Change this such that the split is done once per call of the training.

  1. What is the influence on the training?
  2. How is the performance for different optimizers?

Exercise 7.6 (Train with softmax) Include softmax into to model for training and see how the performance is influenced.

Exercise 7.7 (Optional: PyTorch Lightning) The module pytorch lightning1 promises to streamline the process for the training and to reduce the code.

With the Lightning in 15 minutes (or any other tutorial) rewrite your code to use this framework.

Note: This will make the dvclive integration required below slightly easier.

7.3 How to save a pytorch model

Of course we can save our pytorch model with the methods discussed in Chapter 4 but it is more convenient to use the dedicated functions, see docs.

In short, we only need to call torch.save(model.state_dict(), file) to save in the default format using pickle, be careful when using it, see Section 4.2.

torch.save("model.pt")
loaded_model = model = MyFirstNN(X_train.shape[1])
loaded_model.load_state_dict(torch.load("model.pt", weights_only=True))

It is also possible to implement continuous checkpoints during training, see docs. This allows us to resume training after an interrupt or simply store the model at the end of the training.

All of this can be done via the already discussed dvclive interface, the implementation is defined as an exercise (if your module uses lightning this is even easier - see Exercise 7.7).

Exercise 7.8 (dvclive integration into the pytorch training) Implement the DVCLive integration to track the metrics.

Store the model as ONNX

It is also possible to export the model in the ONNX format, see Section 4.1 and for specifics the docs.

7.4 Backward Propagation of Error - Backpropagation

Our first NN is a success, but how did it learn the necessary parameters to perform its task?

The answer is a technique called Backward Propagation of Error or in short Backpropagation. This essential component of for machine learning helps us to work out how the loss we compute translates into changes of the weights and biases in our network. In our training loop train_model the lines

y_pred = model(X_batch)            # Forward pass through the model
loss = loss_fn(y_pred, y_batch)    # Compute the loss
loss.backward()                    # Backpropagation
optimizer.step()                   # Update the model parameters
optimizer.zero_grad()              # Reset the gradients

are what we are focusing on in this section.

Important

A very nice and structured introduction by IBM can be found here2.

An introduction closer to the mathematics is shown in (Brunton and Kutz 2022, chap. 6.3).

The introduction of the technique is contributed to the paper of Rumelhart, Hinton, and Williams (1986). As usual, predecessors and independent similar proposals go back to the 1960s. As we have seen, our NN can mathematically be described by nested functions, that are called inside the loss function, \(\mathscr{L}(\Theta)\), for \(\Theta\) being all trainable parameters.

During the training, we can compute the change between the NN output and the provided label and use it in a gradient descent method3. The result is the following iteration \[ \Theta^{(n+1)} = \Theta^{(n)} - \delta \nabla \mathscr{L}\left(\Theta^{(n)}\right), \] where \(\delta\) is called the learning rate and it is prescribed.

To compute the derivative of \(\mathscr{L}(\Theta)\), we can use the chain rule as this propagates the error backwards through the network. For \(h(x) = g(f(x))\) the derivative \(h'(x)\) is computed as \[ h'(x) = g'(f(x))f'(x) \quad \Leftrightarrow \quad \frac{\partial\, h}{\partial\, x} (x) = \frac{\partial\, g}{\partial\, x}(f(x)) \cdot \frac{\partial\, f}{\partial\, x}(x), \] or in Leibniz notation for a variable \(z\) that depends on \(y\), which itself depends on \(x\) we get \[ \frac{\mathrm{d}\, z}{\mathrm{d}\, x} = \frac{\mathrm{d}\, z}{\mathrm{d}\, y} \cdot \frac{\mathrm{d}\, y}{\mathrm{d}\, x}. \]

Example 7.1 (A simple case) To illustrate the procedure we start with the simplest example, illustrated in Figure 7.6.

Figure 7.6: One node, one layer model for illustration of the backpropagation algorithm.

To get the output \(y\) from \(x\) the following computation is required \[ y = g(z, b) = g(f(x, a), b). \]

If we now assume a means square error for the final loss, \[ \mathscr{L} = \frac12 (y_0 - y)^2, \] we get an error, depending on the weights \(a\), \(b\), and for \(y_0\) being the ground truth or correct result. In order to minimize the error according to \(a\) and \(b\) we need to compute the partial derivatives with respect to these variables

\[ \begin{aligned} \frac{\partial\, \mathscr{L}}{\partial\, a} &= \frac{\partial\, \mathscr{L}}{\partial\, y} \cdot \frac{\partial\, y}{\partial\, a} &=& \big[y = g(z, b)\big]\\ &= \frac{\partial\, \mathscr{L}}{\partial\, y}\cdot \frac{\partial\, g}{\partial\, z} \cdot \frac{\partial\, z}{\partial\, a} &=& \big[z = f(x, a)\big]\\ &= \frac{\partial\, \mathscr{L}}{\partial\, y}\cdot \frac{\partial\, g}{\partial\, z} \cdot \frac{\partial\, f}{\partial\, a} \end{aligned} \]

\[ \begin{aligned} \frac{\partial\, \mathscr{L}}{\partial\, b} &= \frac{\partial\, \mathscr{L}}{\partial\, y} \cdot \frac{\partial\, y}{\partial\, b} &=& \big[y = g(z,b)\big] \\ &= \frac{\partial\, \mathscr{L}}{\partial\, y}\cdot \frac{\partial\, g}{\partial\, b} \phantom{ \cdot \frac{\partial\, f}{\partial\, a}} \end{aligned} \]

For a particular \(\mathscr{L}\), \(g\), and \(f\) we can compute it explicitly, e.g. \(\mathscr{L}=\tfrac12(y_0 - y)^2\), \(g(z,b) = b z\), and \(f(x,a) = \tanh(a x)\)

\[ \begin{aligned} \frac{\partial\, \mathscr{L}}{\partial\, a} &= \frac{\partial\, \mathscr{L}}{\partial\, y} \frac{\partial\, g}{\partial\, z}\frac{\partial\, f}{\partial\, a} &=& - (y_0 - y) \cdot b \cdot (1 - \tanh^2(a x)) \cdot x,\\ \frac{\partial\, \mathscr{L}}{\partial\, b} &= \frac{\partial\, \mathscr{L}}{\partial\, y} \frac{\partial\, g}{\partial\, b} &=& - (y_0 - y) \cdot \tanh(a x). \end{aligned} \]

With this information we can define the gradient descent update \[ \begin{aligned} a^{(k+1)} &= a^{(k)} - \delta \frac{\partial\, \mathscr{L}}{\partial\, a} = a^{(k)} - \delta \left(- (y_0 - y) \cdot b^{(k)} \cdot (1 - \tanh^2(a^{(k)} x)) \cdot x\right), \\ b^{(k+1)} &= b^{(k)} - \delta \frac{\partial\, \mathscr{L}}{\partial\, b} = b^{(k)} - \delta \left( - (y_0 - y) \cdot \tanh(a^{(k)} x)\right). \end{aligned} \]

Note

Definition 7.1 (Backpropagation) Now that we have a better understanding we can define the backdrop procedure, see Brunton and Kutz (2022) as reference:

  1. Specify the NN along with the labeled training data.
  2. Initialize the weights and biases of the NN with random values. If they are initialized with zero the gradient method will update all of them in the same fashion which is not what we are looking for.
  3. In a loop until convergence or a maximum of iterations is achieved:
    1. Run the training data through the NN to compute \(y\). Compute the according loss and its derivatives with respect to each weight and bias.
    2. For a given learning rate \(\delta\) update the NN parameters via the gradient method.

We can see this reflected in our code for the NN above.

Exercise 7.9 (Dogs and cats) Let us translate this findings to our example visualized in Figure 7.3. In order to do so, we need to specify our variables in more detail. To simplify it a bit we set the biases to the zero vector.

First, we call our output \(p = [p_1, p_2]\) and our labels are encoded in one-hot encoding \(y = [y_1, y_2]\) where \(y_1=1\) and \(y_2=0\) if the image is a dog, and vice versa if the image is a cat.

As our loss function we use cross-entropy \[ \begin{aligned} \mathscr{L}(\Theta) &= - \frac12 \sum_{i=1}^2 y_i \log(p_i) = - \frac{y_1 \log(p_1) + y_2 \log(p_2)}{2} \\ &= -\frac{1}{2} \left( y_1 \log(p_1) + (1-y_1) \log(1-p_1) \right), \end{aligned} \] for a single sample with the above notation. The last line is true to the fact that the sum of the entries of \(y\) and \(p\) is equal to \(1\).

In order to make the computation of the derivates easier we use the variables as described in Figure 7.7.

Figure 7.7: One node, one layer model for illustration of the backpropagation algorithm.

Therefore, we get \(p = g(z)=\sigma(B v)\) (softmax), and \(v = f(u)= \tanh(A x)\). Overall we want to compute the change in \(B_{i, j}\) (for some fixed indices \(i\), and \(j\)). \[ \frac{\partial\, \mathscr{L}}{\partial \, B_{i, j}} = \frac{\partial\, \mathscr{L}}{\partial \, p} \cdot \frac{\partial\, p}{\partial \, z} \cdot \frac{\partial\, z}{\partial \, B_{i, j}} \]

Perform this task in the following steps:

  1. Compute \(\partial_{p_i} \mathscr{L}\).
  2. The computation of the Jacobian of \(\sigma(z)\) is tricky but together with the cross-entropy loss it is straight forward, therefore compute \(\partial_{z_i} \mathscr{L}(\sigma(z))\).
  3. Compute \(\partial_{B_{i, j}} z\).
  4. Write down the components showing up for the chain rule for \(\frac{\partial\, \mathscr{L}}{\partial \, A_{i, j}}\) (similar as above).

  1. Install it via pdm add lightning.↩︎

  2. Access on 4th April 2025.↩︎

  3. see Kandolf (2025), Section 6.1 or follow the direct link.↩︎