9  Autoencoders

Another class of neural networks, that exploit low-dimensional structure in high-dimensional data, is called autoencoders. Autoencoders can be seen as a generalization of linear subspace embedding of singular value decomposition/principal component analysis1 to a nonlinear manifold embedding.

The main idea, as stated in (Goodfellow, Bengio, and Courville 2016, chap. 14), is to train a NN to copy its input \(x\) via the latent space \(y\) to the output \(z\). Crucially, \(y\) is usually a low dimensional space, represents the output of the encoding and the input of the decoding part of the NN. This will force the latent space to reveal some useful properties that we can studied and exploited, same as PCA.

Figure 9.1: General structure of an autoencoder with the encoder network on the left and the decoder network on the right. Note: they are not necessarily identical or just the reverse.

During the training the weights of the (two) neural networks are learned by minimizing a loss function that measures the difference between the input \(x\) and the reconstruction \(z\), e.g. \(\operatorname{argmin}\|z - x\|^2\).

To connect this concept back to PCA and what we discussed in the Eigenfaces Example2, where we used truncated SVD to project into a low dimensional space and than back to reconstruct the image, we formulate it in a more mathematical fashion.

We call the encoder \(y=f_\Theta(x)\) and the decoder \(z=g_\Theta(y)\). Therefore, \(z=(g \circ f)(x) = g(f(x))\) and \[ \begin{aligned} f : \mathcal{X} \to \mathcal{Y}\\ g : \mathcal{Y} \to \mathcal{Z}\\ \end{aligned} \] and the loss is computed as \[ \mathscr{L}(x, g(f(x))), \] with an appropriate loss function \(\mathscr{L}\).

This loss function is most of the time including some more general constraints to enforce something like sparsity, similar as LASSO3 or RIDGE4, this is often referred to regularized autoencoders.

Note

We can use the concept of an autoencoder to span the same space as a PCA. For a linear encoder/decoder and a latent space smaller than the input with the loss function being the mean least square, the autoencoder will learn the principal components of the data. For nonlinear functions \(f\) and \(g\) a more general form is learned.

9.1 Applications

We can use Autoencoders for various different applications:

  1. Image denoising: In this case the autoencoder is trained with noisy images and learns how to reconstruct clean images from them.
  2. Anomaly detection: An autoencoder that is trained with normal images can detect anomalies by identifying inputs that result in a high reconstruction loss, this indicated a deviation from the learned images.
  3. Image generation: After training an autoencoder it is possible to use the decoder part to generate images for manipulated values in the latent space.
  4. Feature extraction: The latent representation of the trained autoencoder can provide a compact informative feature for image classification and retrieval.
  5. Image compression: By storing only the latent space representation an image can be compressed and reconstructed by inferring through the decoder.
  6. Image enhancement: By providing a low resolution images and their high resolution counterparts images can learn to enhance certain features of images.

To illustrate this we construct an autoencoder for denoising the MNIST dataset we have seen before (Section 1.5).

First, we need to prepare and augment the data with some noise.

from sklearn.datasets import fetch_openml
import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import TensorDataset, DataLoader, random_split
import torchvision.transforms.v2 as transforms
torch.manual_seed(6020)
np.random.seed(6020)

1def transform(X, noise=0.4):
    X_org = torch.tensor(X / 255.0).to(torch.float)                 
    X_noise = torch.clamp(
                torch.tensor(                                         
                    np.random.normal(0.5, noise, X_org.shape)       
                ).to(torch.float) + X_org,
                0, 1)
    return X_noise, X_org

mnist = fetch_openml("mnist_784", as_frame=False)
size = 60000
X_train, X_test = mnist.data[:size], mnist.data[60000:]

trainset = TensorDataset(*transform(X_train))
testset = TensorDataset(*transform(X_test))

val_size = int(0.1 * len(trainset))
train_size = len(trainset) - val_size
train, val = random_split(trainset, [train_size, val_size])

trainLoader = DataLoader(train, batch_size=32, shuffle=True)
valLoader = DataLoader(val, batch_size=32, shuffle=True)
testLoader = DataLoader(testset, batch_size=32, shuffle=False)
1
Transform the images and return the original as ground truth.

Next we slightly modify our training function to suit our new task.

Show the code for the training loop
def train_model(model, train_dl, val_dl, loss_fn, optimizer,
                epochs=10):

    met = {
        "train_loss": [],
        "val_loss": []}

    for epoch in range(epochs):
        model.train()
        train_loss, train_corr = 0, 0

        for X_batch, y_batch in train_dl:
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            train_loss += loss.item()

        model.eval()
        val_loss, val_corr = 0, 0                                       
        with torch.no_grad():
            for X_val, y_val in val_dl:
                y_val_pred = model(X_val)
                val_loss += loss_fn(y_val_pred, y_val).item()

        met["val_loss"].append(val_loss / len(val_dl))
        met["train_loss"].append(train_loss / len(train_dl))
    return met

Our autoencoder is quite simple with just 3 stages per encoder.

class AutoEncoder(nn.Module):
    def __init__(self):
        super(AutoEncoder, self).__init__()
        
        self.encoder = torch.nn.Sequential(
            nn.Linear(784, 128),
            nn.LeakyReLU(),
            nn.Linear(128, 64),
            nn.LeakyReLU(),
            nn.Linear(64, 32),            
        )

        self.decoder = torch.nn.Sequential(
            nn.Linear(32, 64),
            nn.LeakyReLU(),
            nn.Linear(64, 128),
            nn.LeakyReLU(),
            nn.Linear(128, 784),
            nn.Sigmoid(),
        )
        
    def forward(self, x):
1        y = self.encoder(x)
        z = self.decoder(y)
        return z
1
Intermediate - latent space - representation

Finally we need to train our model.

model = AutoEncoder()
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())

epochs = 5
history = train_model(model, trainLoader, valLoader, loss_fn, optimizer,
                      epochs=epochs)

Note, we only used 5 epochs for the following results and this time we used Adam as optimizer.

Show the code for the figure
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

x, y = next(iter(testLoader))
model.eval()
with torch.no_grad():
    z = model(x)

length = 10
im = np.zeros((3 * 28, length * 28), float)
for j in range(length):
    im[0:28, 28*j:28*(j+1)] = y[j].reshape(28, 28).numpy()
    im[28:28*2, 28*j:28*(j+1)] = x[j].reshape(28, 28).numpy()
    im[28*2:28*3, 28*j:28*(j+1)] = z[j].reshape(28, 28).numpy()

plt.figure()
plt.imshow(im, cmap="binary")
plt.axis("off")

plt.show()
Figure 9.2: Some of the test images denoised via the model. Top row shows the original image, second with noise and last row after calling the model.

Exercise 9.1 (Add more transformations) Rework the transform function to include the following additions.

  1. Shuffle the pixels of the image with a fixed permutation.
  2. Rotate the image with a fixed rotation angle, extend the image first, rotate and than crop it again.
  3. Flip the image - mirror with respect to the middle.

Retrain the autoencoder with these training sets and see how it performs (can it handle all of them at the same time?).

Exercise 9.2 (Explore the latent space and generate numbers) Rework the above autoencoder in such a fashion that the encoder an decoder can be accessed individually. If for the later tasks reworking the model such that the latent space is smaller is beneficial, feel free to do so, e.g. only 2 dimensions.

  1. Explore the latent space (e.g. similar to what was done in the introduction to Clustering and Classification for the cats and dogs dataset) and see how well the clusters of numbers can be seen.
  2. Use the knowledge of the previous task to create a number generator by providing different self generated elements from the latent space to the decoder part of our autoencoder.

  1. see Kandolf (2025), Section 4 for SVD and 4.2 for PCA or follow the direct link.↩︎

  2. see Kandolf (2025), Section 4.2.2 or follow the direct link.↩︎

  3. see Kandolf (2025), Section 7.1.1 or follow the direct link.↩︎

  4. see Kandolf (2025), Section 7.1.2 or follow the direct link.↩︎