Clustering and Classification

One aspect of machine learning is binning data into meaningful and distinct categories that can be used for decision-making, amongst other things. This process is referred to as clustering and classification. To achieve this task the main modus of operation is to find a low-rank feature space that is informative and interpretable.

One way to extract the dominate features from a dataset is the singular value decomposition (SVD) or principal component analysis, see (Kandolf 2025). The main idea is, to not work in the high dimensional measurement space, but instead in the low rank feature space. This allows us to significantly reduce the cost (of computation or learning) by performing our clustering in this low dimensional space.

In order to find the feature space there are two main paths for machine learning:

As the names suggest, in unsupervised learning, no labels are given to the algorithm and it must find patterns in the data to generate clusters and labels such that predictions can be made. This is often used to discover previously unknown patterns in the (low-rank) subspace of the data and leads to feature engineering or feature extraction. These features can than be used for building a model, e.g. a classifier.

Supervised learning, on the other hand, uses labelled datasets. The training data is labelled for cross-validation by a teacher or expert. I.e. the input and output for a model is explicitly provided and regression methods are used to find the best model for the given labelled data. There are several different forms of supervised learning like reinforcement learning, semi-supervised learning, or active learning.

The main idea behind machine learning is to construct or exploit the intrinsic low-rank feature space of a given dataset, how to find these is determined by the algorithm.

Feature selection and generation of a feature space

Before we go into specific algorithms lets have a general look at what we are going to talk about. We will follow (Brunton and Kutz 2022, sec. 5.1) for this introduction.

Fischer Iris dataset

First we look into one of the standard datasets well known in the community, the so called Fischer Iris dataset Fisher (1936).

Note

Luckily for us there are several data packages available for general use. One of them is the Fisher Iris dataset.

Figure 1: Illustration of the recorded features and structure of the iris dataset inspired by the Medium article - Exploring the Iris flower dataset.

We use the possibilities of sklearn 1 to access the dataset in Python.

Let us start right away by loading and visualizing the data in a 3D scatter plot (there are 4 features present so one is not shown).

Show the code for the figure
import plotly.express as px
from sklearn.datasets import load_iris
%config InlineBackend.figure_formats = ["svg"]

iris = load_iris(as_frame=True)
iris.frame["target"] = iris.target_names[iris.target]
df = iris.frame

fig = px.scatter_3d(df, x="sepal length (cm)", y="sepal width (cm)",
                    z="petal width (cm)", color="target")
camera = dict(
    eye=dict(x=1.49, y=1.49, z=0.1)
)
fig.update_layout(scene_camera=camera)
fig.show()
Figure 2: Iris dataset with 150 samples from three distinct species, Setosa, Versicolor, and Virginica, as a scatter 3D plot along the features sepal length, sepal width, and petal width, the forth feature petal width is not shown.

As can be seen in Figure 2 the three features are enough to easily separate Sentosa and to a very good degree Versicolor from Virginica, where the last two have a small overlap in the samples provided.

The petal width seems to be the best feature for the classification and no highly sophisticated machine learning algorithms are required. We can see this even better if we use a so called Pair plot from the seaborn package as seen in Figure 3.

Show the code for the figure
import seaborn as sns
from sklearn.datasets import load_iris
%config InlineBackend.figure_formats = ["svg"]

_ = sns.pairplot(iris.frame, hue="target", height=1.45)
Figure 3: Iris dataset with 150 samples from three distinct species, Setosa, Versicolor, and Virginica, as a pair grid plot. The diagonal shows the univariate distribution of the feature.

Here a grid with plots is created where each (numeric) feature is shared across a row and a column. Therefore, in a single plot we can see the combination of two, and in the diagonal we see the univariate distribution of the values. After a close inspection, we can also see here that petal width provides us with the best distinction feature and sepal width with the worst.

Even though we already have enough to give a good classification for this dataset, it is worth to investigate a bit further. We illustrate how the principal component analysis (PCA)2 can be applied here.

Show the code for the figure
import numpy as np
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

A = df.iloc[: , :4].to_numpy()

Xavg = A.mean(0)
B = A - np.tile(Xavg, (150, 1))
U, S, VT = np.linalg.svd(B, full_matrices=False)
C = B @ VT[:2, :].T
q = np.linalg.norm(S[:2])**2 / np.linalg.norm(B, "fro")**2

plt.figure()
for name in iris.target_names:
    index = df["target"] == name
    plt.scatter(C[index, 0], C[index, 1], label=name)
plt.legend()
plt.xlabel(r"$PC_1$")
plt.ylabel(r"$PC_2$")
plt.show()
Figure 4: Feature reduction with PCA for the Iris dataset. We show the first two components principal components.

As can be seen in the Figure 4 the first two components cover about 97.77% of the total variation in the samples and in accordance with these results a very good separation of the three species is possible with them

Note

Definition 1 (Explained Variance) The explained variance is a measure of how much of the total variance in the data is explained via the principal component. It is equivalent to the singular value associated with the component so in the above example it is nothing else than \[ \rho_i = \frac{\sigma_i^2}{\|\Sigma\|^2}. \]

The explained variance can be helpful to select the amount of principal values you use for a classification. But be aware, as we will see later, the largest principal value might not be the best to use.

Dimensionality Reduction

As we can see with the Fischer Iris dataset, we often have a problem of dimensionality, if we try to visualize something. The data has four dimensions but it is very hard for us to perceive, let alone visualize, four dimensions (if the forth dimension is not time).

Therefore, dimensional reduction techniques are often applied. One of these is PCA others are:

  • Multidimensional scaling (MDS) - preserve the distance between observations while doing the reduction.
  • Isomap - create a graph by connecting each observation to its neighbors and reduce the dimensions by keeping the geodesic distances, i.e. the number of nodes on the graph between observations
  • \(t\)-distributed stochastic neighbor embedding (\(t\)-SNE) - try to reduce dimensions while keeping similar observations close and dissimilar observations apart.
  • Linear discriminant analysis (LDA) - projects data onto a hyperplane where the hyperplane is chosen in such a way that it is most discriminative for the observations see Linear Discriminant Analysis (LDA).
Note

The sklearn package has all the tools to perform a PCA, see Link for an example using the Iris dataset, same as in the beginning of this section.

Now let us take a look at a more complicated example from Brunton and Kutz (2022).

Dogs and Cats

In this dataset we explore how to distinguish between images dogs and cats, more particular the head only. The dataset contains \(80\) images of dogs and \(80\) images of cats, each with \(64 \times 64\) pixels and therefore a total of \(4096\) feature measurements per image.

Important

The following dataset and the basic structure of the code is from (Brunton and Kutz 2022, Code 5.2 - 5.4). Also see GitHub.

As before, a general inspection of the dataset is always a good idea. With image we usually look at some to get an overview.

Show the code for the figure
import numpy as np
import scipy
import requests
import io
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

def createImage(data, x, y, width=64, length=64):
    image = np.zeros((width * x, length * y))
    counter = 0
    for i in range(x):
        for j in range(y):
            img = np.flipud(np.reshape(data[:, counter], (width, length)))
            image[length * i: length * (i + 1), width * j: width * (j + 1)] \
                    = img.T
            counter += 1
    return image

response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/catData.mat")
cats = scipy.io.loadmat(io.BytesIO(response.content))["cat"]

plt.figure()
plt.imshow(createImage(cats, 4, 4), cmap=plt.get_cmap("gray"))
plt.axis("off")

response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/dogData.mat")
dogs = scipy.io.loadmat(io.BytesIO(response.content))["dog"]
plt.figure()
plt.imshow(createImage(dogs, 4, 4), cmap=plt.get_cmap("gray"))
plt.axis("off")
Figure 5: The first 16 cats from the dataset.
Figure 6: The first 16 dogs from the dataset.

Again, we use PCA to reduce the feature space of our data.

Show the code for the figure
import numpy as np
import scipy
import requests
import io
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

CD = np.concatenate((dogs, cats), axis=1)
U, S, VT = np.linalg.svd(CD - np.mean(CD), full_matrices=False)

for j in range(4):
    plt.figure()
    U_ = np.flipud(np.reshape(U[:, j], (64, 64)))
    plt.imshow(U_.T, cmap=plt.get_cmap("gray"))
    plt.axis("off")
plt.show()
(a) First principal component.
(b) Second principal component.
(c) Third principal component.
(d) Forth principal component.
Figure 7: First four modes of the PCA of the 160 images.

In Figure 7 we can see the first four modes of the PCA. While mode 2 (Figure 7 (b)) highlights the pointy ears common in cats, mode 3 (Figure 7 (c)) is recovers more dog like features. Consequently, these two features make a good choice for classification.

Of course this is not the only possible representation of the initial data. We can also use the Wavelet transform3.

The multi resolution analysis features of the Wavelet transformation can be used to transform the image in such a way that the resulting principal components yield a better separation of the two classes.

Note

We use the following workflow to get from the original image \(A_0\) to the \(\tilde{A}\) image in a sort of Wavelet basis.

Figure 8: Workflow to get from the original image to the wavelet transformed version.

In short, we combine the vertical and horizontal rescaled feature images, as we are most interested in edge detection for this example.

Note that the image has now only a quarter of the pixels of the original image.

For a more detailed explanation see Wavelet decomposition for cats and dogs.

Show code required for the transformation.
def rescale(data, nb):
    x = np.abs(data)
    x = x - np.min(x)
    x = nb * x / np.max(x)
    x = 1 + np.fix(x)
    x[x>nb] = nb
    return x

import pywt
import math

def img2wave(data):
    l, w = data.shape
    data_w = np.zeros((l // 4, w))
    for i in range(w):
        A = np.reshape(data[:, i], (math.isqrt(l), math.isqrt(l)))
        [A_1, (cH1, cV1, cD1)] = pywt.wavedec2(A, wavelet="haar", level=1)
        data_w[:, i] = np.matrix.flatten(rescale(cH1, 256) +
                       rescale(cV1, 256))
    return data_w

Let us try how this changes our first four modes.

Show the code for the figures
CD_w = img2wave(CD)
U_w, S_w, VT_w = np.linalg.svd(CD_w - np.mean(CD_w), full_matrices=False)
l = math.isqrt(CD_w[:, 0].shape[0])

for j in range(4):
    plt.figure()
    U_ = np.flipud(np.reshape(U_w[:, j], (l, l)))
    plt.imshow(U_.T, cmap=plt.get_cmap("gray"))
    plt.axis("off")
plt.show()
(a) First principal component.
(b) Second principal component.
(c) Third principal component.
(d) Forth principal component.
Figure 9: First four modes of the PCA of the 160 images in our wavelet basis.

Right away we can see that the ears and eyes of the cats show up more pronounced in the second mode Figure 9 (b).

Now, how does this influence our classification problem? For this let us first see how easy it is to distinguish the two classes in a scatter plot for the first four modes.

Show the code for the figure
import seaborn as sns
import pandas as pd
%config InlineBackend.figure_formats = ["svg"]


target = ["cat"] * 80 + ["dog"] * 80
C = U[:, :4].T @ (CD - np.mean(CD))
df = pd.DataFrame(C.T, columns=["PV1", "PV2", "PV3", "PV4"])
df["target"] = target
_ = sns.pairplot(df, hue="target", height=1.45)
Figure 10: First four modes of the raw images for cats and dogs.
Show the code for the figure
import seaborn as sns
import pandas as pd
%config InlineBackend.figure_formats = ["svg"]


C_w = U_w[:, :4].T @ (CD_w - np.mean(CD))
df_w = pd.DataFrame(C_w.T, columns=["PV1", "PV2", "PV3", "PV4"])
df_w["target"] = target
_ = sns.pairplot(df_w, hue="target", height=1.45)
Figure 11: First four modes of the wavelet images for cats and dogs.

The two figures (Figure 10, Figure 11) give us a good idea what the different principal values can tell us. It is quite clear, that the first mode is not of much use for differentiation of the two classes. In general the second mode seems to work best (the ears), in contrast to the raw image we can also see that the wavelet basis is slightly better for the separation.

Show the code for the figures
import plotly.express as px


fig = px.scatter_3d(df, x='PV2', y='PV3', z='PV4', color="target")
fig.show()
fig = px.scatter_3d(df_w, x="PV2", y="PV3", z="PV4", color="target")
fig.show()
Figure 12: Scatter plots of modes 2 to 4 for the raw images.
Figure 13: Scatter plots of modes 2 to 4 for the wavelet images.

In Figure 12 and Figure 13 we can find the second and third mode in a 3D plot, illustrating the separation in a different way.

Now that we have an idea what we are dealing with, we can start looking into the two previously described paths in machine learning. The difference between supervised and unsupervised learning can be summarized by the following two images.

Show the code for the figure
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

iris = load_iris(as_frame=True)
iris.frame["target"] = iris.target_names[iris.target]
df = iris.frame

plt.figure()
for name in iris.target_names:
    index = df["target"] == name
    plt.scatter(df.iloc[:, 0][index], df.iloc[:, 2][index], label=name)

plt.figure()
plt.scatter(df.iloc[:, 0], df.iloc[:, 2], color="gray")
(a) Labelled data for supervised learning.
(b) Unlabelled data for unsupervised learning.
Figure 14: Illustration of the difference between supervised and unsupervised learning.

You either provide/have labels or not, see Figure 14.


  1. To add it to pdm us pdm add scikit-learn.↩︎

  2. see Kandolf (2025), Section 4.2 or follow the direct link↩︎

  3. see Kandolf (2025), Section 10 or follow the direct link↩︎