10 Transfer learning

When working with neural networks, it is always a good idea to have a look what other people did. It is very unlikely that we are the first to look at a particular class of problems and somebody else likely solved (or tried to solve) a similar problem. In addition, most of the time we will not have enough training data to do a full training. If we find such a NN we can look at the architecture, reuse parts of it, build on the training included and so forth. This technique is called transfer learning. As mentioned, if our training set is small, this is especially beneficial.

The main scenarios for transfer learning are:

Pretrained model: It takes a lot of resources to train a modern CNN. Quite often people release their final checkpoint and others can benefit from it. This means we can load the model and instead of working with random weights we have the trained weights of somebody else to start off with. We can now continue training with our dataset.
Fixed feature extractor: After loading a pretrained model, we can also remove, for example, the last classification layers of the network and replace them with our own classification. In this process we can specify the number of features we need for our problem and we can get results for our case. Of course we need to make sure that we hold up the principles of the training of the network. So we basically we get a feature extraction like PCA and build our network on this features.
Fine-tuning: Again, we can build on a pretrained model and use our data to fine-tune some layers of the network (with or without replacing the final stages). Usually, we would fix the first layers and only include single layers at the end of the network for this process. This goes along with the idea that the broad features are extracted in the beginning of the network (edges, colors, …) and only the small details later on. This approach limits the probability of overfitting but can shift the network to better identify the specifics of our dataset.

Of course it is not easy to decide exactly in which of the categories our problem falls but as a general rules:

if the new dataset is small we should not not fine-tune but rather only train a classifier.
- if the dataset is very different for the original training set we should probably remove a couple of layers and not the final stage.
if we have a large dataset we can risc fine-tuning of the network as overfitting is limited.
- if the dataset is very different from the original training set we can also train the entire model from scratch but we should still consider initializing with existing weights so we do not need to relearn everything (e.g. edges).

In (Geron 2022, chap. 14) we can find amongst a detailed explanation how CNNs work also a discussion about several CNNs architectures and models that are available for transfer learning.

10.1 Fine-tuning an ImageNet for cats and dogs

As example, we use our already well known dogs and cats example. As mentioned before, the dataset is small and we know that the ImageNet has the capabilities to classify for different species in dogs and cats. Therefore, we will switch the final layer of the network but retrain with all parameters trainable.

10.1.1 Model selection

Via thetorchvision module we have the capabilities to load some pre-existing models and weights, see docs. As mentioned, we want a model trained on the ImageNet dataset, we use EfficientNet (Tan and Le (2019)). We will see how to load the model a bit later but it is important to have a look at the docs to get an idea how we need to prepare our images and how to replace the final hidden layer. There are multiple versions of the network available, we use B0.

10.1.2 Data preparation

As the ImageNet dataset uses normal images we do not use our images in the wavelet basis but our original images.

Show the code for loading the dataset

import numpy as np
import scipy
import requests
import io
import matplotlib.pyplot as plt
import torch
import torchvision as tv
import torchvision.transforms.v2 as transforms
from torch.utils.data import TensorDataset, Dataset, DataLoader, random_split
%config InlineBackend.figure_formats = ["svg"]
# Initialize everything for reproducibility
torch.manual_seed(6020)
np.random.seed(6020)

response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/catData.mat")
cats_w = scipy.io.loadmat(io.BytesIO(response.content))["cat"]

response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/dogData.mat")
dogs_w = scipy.io.loadmat(io.BytesIO(response.content))["dog"]

size = int(np.sqrt(dogs_w.shape[0]))
s = 40
X_train = np.concatenate((dogs_w[:, :s], cats_w[:, :s]), axis=1).T
y_train = np.repeat(np.array([0, 1]), s)
X_test = np.concatenate((dogs_w[:, s:], cats_w[:, s:]), axis=1).T
y_test = np.repeat(np.array([0, 1]), 80 - s)

First, we need to make sure our input is in a format that can be handled by the NN. As the input size of the image is \(256 \times 256\) with RGB channels and with specific normalizations. To do this we use some helper functions from the torchvision.transforms.v2 module.

1T = transforms.Compose([
2    transforms.ToPILImage(),
3    transforms.Grayscale(num_output_channels=3),
4    transforms.Resize((256, 256)),
    transforms.ToTensor(),
5    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

1: Define a pipeline to apply several transformations to a single image.
2: Transform a numpy array to a PIL image.
3: Repeat the colour channel three times to simulate an RGB image.
4: Rescale via interpolation the image from \(64\times 64\) to \(256 \times 256\) pixel.
5: Normalize via the provided parameters.

Some steps can not be directly be applied in the above pipeline, therefore we reshape the flattened image to a square and rotate it before we apply the pipeline to each image.

X_train_tensor = torch.tensor(np.array(
    [T(img.reshape((size, size)).T) for img in X_train]))
y_train_tensor = torch.tensor(y_train, dtype=torch.uint8)
X_test_tensor = torch.tensor(np.array(
    [T(img.reshape((size, size)).T) for img in X_test]))
y_test_tensor = torch.tensor(y_test, dtype=torch.uint8)
dataset = TensorDataset(X_train_tensor, y_train_tensor)

The functions for preparing the training and validation dataset, the training loop and plotting stay the same.

Show the code for some basic functions for training

def myplot(y):
    plt.figure()
    n = y.shape[0]
    plt.bar(range(n), y)
    plt.plot([-0.5, n - 0.5], [0, 0], "k", linewidth=1.0)
    plt.plot([n // 2 - 0.5, y.shape[0] // 2 - 0.5], [-1.1, 1.1],
             "r-.", linewidth=3)
    plt.yticks([-0.5, 0.5], ["cats", "dogs"], rotation=90, va="center")
    plt.text(n // 4, 1.05, "dogs")
    plt.text(n // 4 * 3, 1.05, "cats")
    plt.gca().set_aspect(n / (2 * 3))

def get_dataset(dataset: TensorDataset, val_split: float, batch_size: int):
    val_size = int(val_split * len(dataset))
    train_size = len(dataset) - val_size
    train_ds, val_ds = random_split(dataset, [train_size, val_size])

    # Create a dataset
    train = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    val = DataLoader(val_ds, batch_size=batch_size, shuffle=False)
    return train, val

def train_model(model, dataset, loss_fn, optimizer,
                epochs=250, validation_split=0.1, 
                batch_size=8):

    met = {
        "train_loss": [],
        "val_loss": [],
        "train_acc": [],
        "val_acc": []}

    for epoch in range(epochs):
        model.train()
        train_loss, train_corr = 0, 0
        train_dl, val_dl = get_dataset(dataset, validation_split, batch_size)

        for X_batch, y_batch in train_dl:
            y_pred = model(X_batch)            # Forward pass through the model
            loss = loss_fn(y_pred, y_batch)    # Compute the loss
            loss.backward()                    # Backpropagation
            optimizer.step()                   # Update the model parameters
            optimizer.zero_grad()              # Reset the gradients

            train_loss += loss.item()
            train_corr += (y_pred.argmax(1) == y_batch).sum().item()

        model.eval()
        val_loss, val_corr = 0, 0                                       
        with torch.no_grad():                   # No gradient calculation
            for X_val, y_val in val_dl:
                y_val_pred = model(X_val)
                val_loss += loss_fn(y_val_pred, y_val).item()
                val_corr += (y_val_pred.argmax(1) == y_val).sum().item()

        met["train_loss"].append(train_loss / len(train_dl))
        met["val_loss"].append(val_loss / len(val_dl))
        met["train_acc"].append(train_corr / len(train_dl.dataset))
        met["val_acc"].append(val_corr / len(val_dl.dataset))

    return met

10.1.3 Model loading, modification, and training

Now, we can finally load our pertained model and modify its final layer.

1model = tv.models.efficientnet_b0(weights='IMAGENET1K_V1')
2num_classes = 2
in_features = model.classifier[1].in_features
3model.classifier[1] = torch.nn.Linear(in_features, num_classes)

batch_size = 8
epochs = 10
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
history = train_model(model, dataset, loss_fn, optimizer,
                      epochs=epochs,
                      validation_split=0.1,
                      batch_size=batch_size)

1: Load the model via the torchvision interface.
2: Define the input an output size of the new layer
3: Exchange the final layer of the network to allow for only two classes

Downloading: "https://download.pytorch.org/models/efficientnet_b0_rwightman-7f5810bc.pth" to /home/runner/.cache/torch/hub/checkpoints/efficientnet_b0_rwightman-7f5810bc.pth
  0%|          | 0.00/20.5M [00:00<?, ?B/s]100%|██████████| 20.5M/20.5M [00:00<00:00, 350MB/s]

Now that we have our model and it was trained for 1ß epochs we can have a look at its performance.

Show the code for the figure

model.eval()
with torch.no_grad():
    y_proba = torch.nn.Softmax(dim=1)(model(X_test_tensor))
y_predict = y_proba.argmax(axis=-1)
myplot(y_predict * (-2) + 1)
acc = (y_predict == y_test_tensor).sum().item() / len(y_test)

n = y_proba.shape[0]
plt.figure()
plt.bar(range(n), y_proba[:, 0], color='y', label=r"p(dog)")
plt.bar(range(n), y_proba[:, 1], color='b', label=r"p(cat)", 
        bottom=y_proba[:, 0])
plt.plot([-0.5, n - 0.5], [0.5, 0.5], "r", linewidth=1.0)
plt.gca().set_aspect(n / (1 * 3))
plt.legend()


epochs_range = range(len(history["train_acc"]))
plt_list = [["train_acc", "-"], ["train_loss", ":"],
            ["val_acc", "--"], ["val_loss", "-."]]
plt.figure()
for name, style in plt_list:
    plt.plot(epochs_range, history[name], style, label=name)
plt.legend(loc="lower right")
plt.gca().set_aspect(epochs / (1 * 3))
plt.show()

In Figure 10.1 (a) we can see the final classification of our model with regards to the test set. For only 7 dogs our model is convinced they are not good dogs but cats, and 0 cats are classified as dogs, in comparison to Figure 7.5 (a) and Figure 8.4 (a) we gained some correct classifications. If we look at the probabilities Figure 10.1 (b), we can see that we have a no close calls, but rather a model that is quite sure about the classification. Regarding the history of our optimization, we can see that we start quite good (our initial guess of what is a dog and what is a cat seems to be accurate).

At the end, we have an accuracy of 91.25% for our test set,

Exercise 10.1 (Classifier) Instead of replacing the final layer of the classifier, add an additional layer with only two outputs for our two classes. The simplest way is to use the loaded module as the first element in a torch.nn.Sequential model and add a second linear layer.

Exercise 10.2 (Freeze the original network) Instead of including the layers of the original model in the backpropagation try to freeze them and only train a final classifier.

See if using one of the other EfficientNet versions (e.g. v3) performs better or worse in this scenario.

Exercise 10.3 (Padding) Instead of bloating up the image via Resize use Pad to imped the image with a boundary to make sure to reach the original size constraints.

Try the following scenarios and see if there is a difference in the score: - Symmetric padding with black background - Symmetric padding with white background - Asymmetric padding (image in one corner)

Exercise 10.4 (Use a different model) There are more models trained on EfficientNet, try at least with one other model to see if similar results are possible.