Now that we constructed one of the simplest NNs possible (if we can even call it a NN), we build on this and extend our model. The first step is to introduce some (non-linear) activation functions or transfer functions \[
y = f(A, b, x).
\]
Some common functions are:
linear\[ f(x) = x. \]
binary step\[ f(x) =
\begin{cases}
0 & \text{for}\quad x \leq 0,\\
1 & \text{for}\quad x > 0.\\
\end{cases}
\]
(b) Numerical derivates of the activation functions.
Figure 7.1: Different activation functions and their derivatives.
7.1 The trainable parameters of a Neural Network
Before we build our first NN in the next section, we need to get a better understanding on the dimensions and available parameters for our optimization involved in training a NN. As discussed, these parameters are the matrices combining the weights and the biases. Of course they are related to the neurons.
Figure 7.2: Generic linear NN with weights and biases.
Exercise 7.1 (Compute the parameters of the Neural Network) For the NN display in Figure 7.2 answer the following questions:
What are the shapes of the matrices \(A_1, A_2, A_3, A_4\)?
Which of these matrices are sparse (have multiple zeros)?
What are the shapes of the biases \(b_1, b_2, b_3, b_4\)?
Write down the (expanded) formula for the output \(y\) with respect to the input \(x\) resulting from the composition of the different layers for \(f(x) = x\) in each layer (as seen in Figure 7.2).
Is this formulation a good option to compute \(y\) from \(x\) (provide some reasoning)?
In Figure 7.2 we also used the formulation of hidden layers for all the layers inside the NN. This is a common formulation reflecting the nature of the NN. Their activations (transfer from their specific input to output \(f_j(A_j, b_j, x^{(j-1)})\)) are not exposed to the user and can’t be observed by the user directly.
Now let us build our first NN.
7.2 A Neural Network with pytorch
Important
There are multiple frameworks available for NNs in Python. We will focus on pytorch for these notes.
Nevertheless, see Appendix B for an implementation of the same NN with an alternative framework.
Note
There are a couple of steps required and some foreshadow later decisions. We tried to make the split to provide better understanding and an easy way to follow the material.
Let us start with the input and how to prepare our data for for the model to-be. Before we load everything into the pytorch specific structures, we split our data into a training and test set. This time it is important that we rescale scale our image to \([0, 1]\), otherwise the optimization algorithms perform poorly. We scale by \(255\), our theoretical maximum in a generic image.
We need to convert the training data to PyTorch tensors unfortionalty, this might result in larger data if we are not careful, uint8 works for the loss computation we aim for (see later).
2
Combine \(X\) and \(Y\) such that it can easily be used for a DataLoader.
The easiest way to provide data for training is to use the mentioned DataLoader class. As we have only a very limited number of training data, we split our available data into a training and validation. We do so new and random in each epoch (optimization step). To provide a better overview, we use a dedicated function for this procedure. This function will be called during each optimization step in our training.
from torch.utils.data import DataLoader, random_splitdef get_dataset(dataset: TensorDataset, val_split: float, batch_size: int): val_size =int(val_split *len(dataset)) train_size =len(dataset) - val_size train_ds, val_ds = random_split(dataset, [train_size, val_size])# Create a dataset train = DataLoader(train_ds, batch_size=batch_size, shuffle=True) val = DataLoader(val_ds, batch_size=batch_size, shuffle=True)return train, val
Now that we have taken care of our input, we need to discuss the output of our model. Instead of having a single neuron that is zero or one, we will have two neurons, i.e. \(y\in\mathbb{R^2}\) using a so called one-hot encoding for our two classes. \[
y = \left[\begin{array}{c}1 \\ 0\end{array}\right] = \text{dog},
\quad\text{and}\quad
y = \left[\begin{array}{c}0 \\ 1\end{array}\right] = \text{cat}.
\]
The idea is, that the model will provide us with the probability, that an image shows a dog on \(y_1\) or a cat on \(y_2\) (the sum of both will be \(1\)).
Next, we define our activation function in the first layer by the non-linear function \(f_1 = \tanh\). The idea is to use a more complicated function the reflect the nature of our images. The rest remains unchanged, with the weights aggregated to matrix \(A_1\) and the bias to the vector \(b_1\). In a the next layer we have a linear transform to provide some more freedom and than use the softmax function \(\sigma\) (see Section A.3 for a more detailed explanation and example) to translate the result into a probability \[
\sigma: \mathbb{R^n} \to (0, 1)^n, \quad \sigma(x)_i = \frac{\exp(x_i)}{\sum_j{\exp(x_j)}}.
\]
We visualized the resulting network in Figure 7.3.
Figure 7.3: A two layer structure for the cats vs. dogs classification with a non-linear model.
The following couple of code sections translate Figure 7.3 into a pytorch model. The main idea is that we create a model class that inherits from torch.nn.Module and than perform our training on this class, benefiting from the inherited capabilities.
Tip
As we will see in Exercise 7.9, softmax and our loss function can be combined in a numerical stable way. Therefore, it is often advised not to include softmax during training but for inference.
For the following code snippet, we keep it in as a comment to show where it would be included.
import torch1class MyFirstNN(torch.nn.Module):2def__init__(self, input_params):3super(MyFirstNN, self).__init__()4self.model = torch.nn.Sequential( torch.nn.Linear(input_params, 2), torch.nn.Tanh(), torch.nn.Linear(2, 2), #torch.nn.Softmax(dim=1), )5def forward(self, x): y =self.model(x)return y
1
Create a class inheriting from torch.nn.Module.
2
The layers of the NN are defined in the initialization of the class.
3
Do not forget to initialize the super class.
4
We define a sequential model, this reflects best our desired structure, the first layer that reduces down to two neurons and applies the function \(\tanh\), and a second for softmax.
5
In forward we define how data moves through the network.
Of course it is important to check if the code corresponds to the model we have in mind. For this torchinfo is quite a useful tool. It provides a tabular overview.
import torchinfomodel = MyFirstNN(X_train.shape[1])batch_size =8info = torchinfo.summary(model, (batch_size, X_train.shape[1]), col_width=12)# Nicer formatting for the notesinfo.layer_name_width=15print(info)
Explain how the Param # column is computed and retrace the details.
If we expand the Params size to (KB) we get 8.03, what does this imply on the data type of the parameters?
It is often also quite useful to export the model as an image. Here the torchviz package comes in handy. In our case we can work again with a dot file, that is rendered as Figure 7.4.
(b) Probabilities of the two classes - our actual output of the model.
(c) Summary of the key metrics of the model training.
Figure 7.5: Performance of our model.
In Figure 7.5 (a) we can see the final classification of our model with regards to the test set. For 11 dogs our model is convinced they are not good dogs but cats, and 3 cats are classified as dogs. If we look at the probabilities Figure 7.5 (b), we can see that we have a couple of close calls, but in general our model is quite sure about the classification. Regarding the history of our optimization, we can see three phases, first our accuracy stays constant, right about for the first 80 iterations. Than the network starts learning up to 120 and after that only the loss function declines, but accuracy stays high.
At the end, we have an accuracy of 82.5% for our test set, a bit better than with our linear models Figure 7.1.
Exercise 7.3 (Learning rate and momentum) The used optimizer SGD has two options we want to investigate further. The learning rate lr and the momentum.
Change the learning rate and see how this influences performance, you might want to increase epochs as well. Try \(lr \in [10^{-1}, 10^{-4}]\).
Change the momentum between \(0\) and \(1\), can this improve the predictions for different learning rates and such that the NN is more sure of its decision (in Figure 7.5 (b) the bars are not almost equal but lean to one side).
Exercise 7.4 (Change the optimizer) As we have seen so often before, the optimizer used for the problem can change how fast convergence is reached (if at all).
Have a look at the possibilites provided in the framework Overview - Optimizers - they are basically all improvements on Stochastic Gradient Descent.
Test different approaches, especially one of the Adam (Adam, AdamW, Adamax) and Ada (Adadelta, Adafactor, Adagrad, SparseAdam) implementations and record the performance (final accuracy).
Exercise 7.5 (Train and validation split) In the above version of the train loop we split our dataset in each epoch. Change this such that the split is done once per call of the training.
What is the influence on the training?
How is the performance for different optimizers?
Exercise 7.6 (Train with softmax) Include softmax into to model for training and see how the performance is influenced.
Exercise 7.7 (Optional: PyTorch Lightning) The module pytorch lightning1 promises to streamline the process for the training and to reduce the code.
With the Lightning in 15 minutes (or any other tutorial) rewrite your code to use this framework.
Note: This will make the dvclive integration required below slightly easier.
7.3 How to save a pytorch model
Of course we can save our pytorch model with the methods discussed in Chapter 4 but it is more convenient to use the dedicated functions, see docs.
In short, we only need to call torch.save(model.state_dict(), file) to save in the default format using pickle, be careful when using it, see Section 4.2.
torch.save("model.pt")loaded_model = model = MyFirstNN(X_train.shape[1])loaded_model.load_state_dict(torch.load("model.pt", weights_only=True))
It is also possible to implement continuous checkpoints during training, see docs. This allows us to resume training after an interrupt or simply store the model at the end of the training.
All of this can be done via the already discussed dvclive interface, the implementation is defined as an exercise (if your module uses lightning this is even easier - see Exercise 7.7).
Exercise 7.8 (dvclive integration into the pytorch training) Implement the DVCLive integration to track the metrics.
Store the model as ONNX
It is also possible to export the model in the ONNX format, see Section 4.1 and for specifics the docs.
7.4 Backward Propagation of Error - Backpropagation
Our first NN is a success, but how did it learn the necessary parameters to perform its task?
The answer is a technique called Backward Propagation of Error or in short Backpropagation. This essential component of for machine learning helps us to work out how the loss we compute translates into changes of the weights and biases in our network. In our training loop train_model the lines
y_pred = model(X_batch) # Forward pass through the modelloss = loss_fn(y_pred, y_batch) # Compute the lossloss.backward() # Backpropagationoptimizer.step() # Update the model parametersoptimizer.zero_grad() # Reset the gradients
are what we are focusing on in this section.
Important
A very nice and structured introduction by IBM can be found here2.
The introduction of the technique is contributed to the paper of Rumelhart, Hinton, and Williams (1986). As usual, predecessors and independent similar proposals go back to the 1960s. As we have seen, our NN can mathematically be described by nested functions, that are called inside the loss function, \(\mathscr{L}(\Theta)\), for \(\Theta\) being all trainable parameters.
During the training, we can compute the change between the NN output and the provided label and use it in a gradient descent method3. The result is the following iteration \[
\Theta^{(n+1)} = \Theta^{(n)} - \delta \nabla \mathscr{L}\left(\Theta^{(n)}\right),
\] where \(\delta\) is called the learning rate and it is prescribed.
To compute the derivative of \(\mathscr{L}(\Theta)\), we can use the chain rule as this propagates the error backwards through the network. For \(h(x) = g(f(x))\) the derivative \(h'(x)\) is computed as \[
h'(x) = g'(f(x))f'(x) \quad \Leftrightarrow \quad
\frac{\partial\, h}{\partial\, x} (x) = \frac{\partial\, g}{\partial\, x}(f(x)) \cdot \frac{\partial\, f}{\partial\, x}(x),
\] or in Leibniz notation for a variable \(z\) that depends on \(y\), which itself depends on \(x\) we get \[
\frac{\mathrm{d}\, z}{\mathrm{d}\, x} = \frac{\mathrm{d}\, z}{\mathrm{d}\, y} \cdot \frac{\mathrm{d}\, y}{\mathrm{d}\, x}.
\]
Example 7.1 (A simple case) To illustrate the procedure we start with the simplest example, illustrated in Figure 7.6.
Figure 7.6: One node, one layer model for illustration of the backpropagation algorithm.
To get the output \(y\) from \(x\) the following computation is required \[
y = g(z, b) = g(f(x, a), b).
\]
If we now assume a means square error for the final loss, \[
\mathscr{L} = \frac12 (y_0 - y)^2,
\] we get an error, depending on the weights \(a\), \(b\), and for \(y_0\) being the ground truth or correct result. In order to minimize the error according to \(a\) and \(b\) we need to compute the partial derivatives with respect to these variables
For a particular \(\mathscr{L}\), \(g\), and \(f\) we can compute it explicitly, e.g. \(\mathscr{L}=\tfrac12(y_0 - y)^2\), \(g(z,b) = b z\), and \(f(x,a) = \tanh(a x)\)
With this information we can define the gradient descent update \[
\begin{aligned}
a^{(k+1)} &= a^{(k)} - \delta \frac{\partial\, \mathscr{L}}{\partial\, a} = a^{(k)} - \delta \left(- (y_0 - y) \cdot b^{(k)} \cdot (1 - \tanh^2(a^{(k)} x)) \cdot x\right), \\
b^{(k+1)} &= b^{(k)} - \delta \frac{\partial\, \mathscr{L}}{\partial\, b} = b^{(k)} - \delta \left( - (y_0 - y) \cdot \tanh(a^{(k)} x)\right).
\end{aligned}
\]
Note
Definition 7.1 (Backpropagation) Now that we have a better understanding we can define the backdrop procedure, see Brunton and Kutz (2022) as reference:
Specify the NN along with the labeled training data.
Initialize the weights and biases of the NN with random values. If they are initialized with zero the gradient method will update all of them in the same fashion which is not what we are looking for.
In a loop until convergence or a maximum of iterations is achieved:
Run the training data through the NN to compute \(y\). Compute the according loss and its derivatives with respect to each weight and bias.
For a given learning rate \(\delta\) update the NN parameters via the gradient method.
We can see this reflected in our code for the NN above.
Exercise 7.9 (Dogs and cats) Let us translate this findings to our example visualized in Figure 7.3. In order to do so, we need to specify our variables in more detail. To simplify it a bit we set the biases to the zero vector.
First, we call our output \(p = [p_1, p_2]\) and our labels are encoded in one-hot encoding \(y = [y_1, y_2]\) where \(y_1=1\) and \(y_2=0\) if the image is a dog, and vice versa if the image is a cat.
As our loss function we use cross-entropy \[
\begin{aligned}
\mathscr{L}(\Theta) &= - \frac12 \sum_{i=1}^2 y_i \log(p_i) = - \frac{y_1 \log(p_1) + y_2 \log(p_2)}{2} \\
&= -\frac{1}{2} \left( y_1 \log(p_1) + (1-y_1) \log(1-p_1) \right),
\end{aligned}
\] for a single sample with the above notation. The last line is true to the fact that the sum of the entries of \(y\) and \(p\) is equal to \(1\).
In order to make the computation of the derivates easier we use the variables as described in Figure 7.7.
Figure 7.7: One node, one layer model for illustration of the backpropagation algorithm.
Therefore, we get \(p = g(z)=\sigma(B v)\) (softmax), and \(v = f(u)= \tanh(A x)\). Overall we want to compute the change in \(B_{i, j}\) (for some fixed indices \(i\), and \(j\)). \[
\frac{\partial\, \mathscr{L}}{\partial \, B_{i, j}} = \frac{\partial\, \mathscr{L}}{\partial \, p} \cdot \frac{\partial\, p}{\partial \, z} \cdot \frac{\partial\, z}{\partial \, B_{i, j}}
\]
Perform this task in the following steps:
Compute \(\partial_{p_i} \mathscr{L}\).
The computation of the Jacobian of \(\sigma(z)\) is tricky but together with the cross-entropy loss it is straight forward, therefore compute \(\partial_{z_i} \mathscr{L}(\sigma(z))\).
Compute \(\partial_{B_{i, j}} z\).
Write down the components showing up for the chain rule for \(\frac{\partial\, \mathscr{L}}{\partial \, A_{i, j}}\) (similar as above).
Brunton, Steven L., and J. Nathan Kutz. 2022. Data-Driven Science and Engineering - Machine Learning, Dynamical Systems, and Control. 2nd ed. Cambridge: Cambridge University Press.
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.”Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.