1 Unsupervised learning

We start with unsupervised learning. The goal of unsupervised learning is to discover clusters in the data of observations that have no labels, i.e. we have nothing to look at as reference. There are several algorithms to perform this task, the most prominent is the \(k\)-means clustering algorithm.

Tip

We will use scikit-learn in this section heavily. The User Guide and Examples sections in the documentation are well structured and a good source for additional information.

Furthermore, there are several Examples that compare different methods on different datasets and therefore give a nice overview on strengths and weaknesses. Have look at them to gain further insights into the methods and their differences.

1.1 \(k\)-Means Clustering

The \(k\)-means algorithm tries to partition a set of \(m\) (vector-valued) data observations into \(k\) clusters. Where in general, the dimension of the data and the amount of observations is known, the number of clusters is often not known a priori.

The general idea is, to label each observation as belonging to a cluster with the nearest mean (the representative of the cluster). The resulting clusters are called Voronoi cells, see Wikipedia - Voronoi diagram.

Definition 1.1 (\(k\)-Means Algorithm) For a given set of \(m\) observations \((x_1, x_2, \ldots, x_m)\), with \(x_i\in \mathrm{R}^n\) the algorithm strives to find \(k\) sets \((S_1, S_2, \ldots, S_k)\) such that the variance inside the cluster is minimized, i.e. \[ \underset{S}{\operatorname{argmin}} \sum_{i=1}^k \sum_{x\in S_i}\| x - \mu_i \|^2, \] where \(\mu_i\) denotes the mean of \(S_i\).

The algorithm itself is recursive, for a given \(k\)

Randomly initialize \(k\) points \(\mu_1, \mu_2, \ldots, \mu_k\), as the cluster centers,
Label each observation \(x_i\) by the nearest cluster center \(\mu_j\), all points with the same label form the set \(S_j\).
Compute the mean of each cluster \(S_j\) and replace the current \(\mu_j\) by it.
Repeat, starting from step 2, until the cluster centers stay stable up to some tolerance.

This algorithm was first introduced in Lloyd (1982) and is therefore often called Lloyd algorithm.

Show the code for Lloyd algorithm

def lloyd(data, centers, steps):
    classes = np.zeros(data.shape[0])
    centers_ = np.copy(centers)
    for i in range(steps):
        for j in range(data.shape[0]):
            classes[j] = np.argmin(np.linalg.norm(centers_ - 
                                   np.tile(data[j, :],
                                       (centers.shape[0], 1)),
                                   axis=1))
        for j in range(centers.shape[0]):
            centers_[j, :] = np.mean(data[classes == j, :], axis=0)
        
    return (classes, centers_)

Note

In general Definition 1.1 is NP-hard and therefore computationally not viable. Nevertheless, a number of heuristic algorithms exist to provide a good performance, despite having no guarantee for global convergence.

We illustrate the proceedings with the help of some artificial observations in two dimensional space, and show how the clustering takes place.

Important

The following dataset and the basic structure of the code is from (Brunton and Kutz 2022, Code 5.5 - 5.6). Also see GitHub.

Show the code for the figure

import numpy as np
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]
np.random.seed(6020)

# Helper for plotting
def plot_lloyd(data, centers, classes, ms=500):
    plt.figure()
    for i in range(centers.shape[0]):
        plt.scatter(data[classes==i, 0], data[classes==i, 1])
        plt.scatter(centers[i, 0], centers[i, 1], c="k", s=ms, marker="*")
    plt.gca().set_aspect(1)

# Create data for illustration
n = 200
# Random ellipse centred in (0, 0) and axis (1, 0.5)
X = np.random.randn(n, 2) * np.array([1, 0.5])
# Random ellipse centred in (1, -2) and axis (1, 0.2)
X2 = (np.random.randn(n, 2) * np.array([1, 0.2])) + np.array([1, -2])
# Rotate ellipse 2 by theta
theta = np.pi / 4
X2 = X2 @ np.array([[np.cos(theta), -np.sin(theta)],
                  [np.sin(theta), np.cos(theta)]] )

centers = np.array([[0., -1.], [-1., 0.]])
data = np.concatenate((X, X2))
# Plot initial step with theoretical assignment
classes = np.concatenate((np.ones(X2.shape[0]), np.zeros(X.shape[0])))
plot_lloyd(data, centers, classes)

# Compute and plot consecutive steps of lloyds algorithm
for i in [1, 2, 5]:
    classes, centers_ = lloyd(data, centers, i)
    plot_lloyd(data, centers, classes)
    centers = centers_
plt.show()

(a) The original two classes and the initial guesses for the centers.

In Figure 1.1 (a) we see the two distinct clusters and the initial guesses for the centers. In the successive plots, we see how the centers move and converge to the final position, as seen in Figure 1.1 (d). In this case the algorithm converges after the sixth step.

Of course the algorithm is sensitive to the initial guess and therefore modern versions provide strategies to determine the initial guess, as well as the number of clusters.

We can see this in action in the sklearn.cluster.KMeans version.

Show the code for the figure

from sklearn.cluster import KMeans
np.random.seed(6020)

kmeans = KMeans(n_clusters=2, random_state=6020).fit(data)
plot_lloyd(data, kmeans.cluster_centers_, kmeans.labels_)
kmeans = KMeans(n_clusters=3, random_state=6020).fit(data)
plot_lloyd(data, kmeans.cluster_centers_, kmeans.labels_)
plt.show()

As can be seen in Figure 1.2 (a) the algorithm comes up with (almost) the same split between the two sets as our crude implementation. If we try for three clusters Figure 1.2 (b) the result is sensible as well.

The major success, of this algorithm in general use, is based on the fast convergence, that requires no supervision. We can also see that it is not very accurate, compare Figure 1.1 (a) and Figure 1.1 (d). This is no surprise, as the algorithm has not all information available.

How can we determine how accurate the algorithm is? If we have no labels this is of course not easy to do but cross-validation is a good tool.

In our case we can produce the labels and we can also split the data beforehand into a training set and a test set. Usually a so called \(80:20\) split is used, i.e. \(80\%\) training data and \(20\%\) test data.

Show the code for the figure

from sklearn.cluster import KMeans
np.random.seed(6020)

# Shuffle data
X_shuffle = X[np.random.permutation(X.shape[0]), :]
X2_shuffle = X2[np.random.permutation(X2.shape[0]), :]

# Split data into two parts
split = n // 5 * 4
data = np.concatenate((X_shuffle[:split, :], X2_shuffle[:split, :]))
test = np.concatenate((X_shuffle[split:, :], X2_shuffle[split:, :]))

# Create clustesr
kmeans = KMeans(n_clusters=2, random_state=6020).fit(data)
classes = kmeans.predict(data)
test_classes = kmeans.predict(test)

# Find wrong classifications
error = test_classes != np.concatenate((np.zeros(n-split), np.ones(n-split)))

# Plotting
plt.figure()
colour = ["tab:orange", "tab:blue"]
for i in range(2):
    plt.scatter(data[classes==i, 0], data[classes==i, 1],
                c=colour[i], alpha=0.5, label="train")
    plt.scatter(test[test_classes==i, 0], test[test_classes==i, 1],
                c=colour[i], label="test")

plt.scatter(test[error, 0], test[error, 1], marker="x", c="k", label="error")
plt.gca().set_aspect(1)
plt.legend()
plt.show()

Figure 1.3: Validation against a test set.

The result of this can be seen in Figure 1.3 where we have two points wrongly classified to the opposite cluster.

There exist several extensions of the basic \(k\)-means algorithm to improve the results and overall performance. Two such versions are the accelerated \(k\)-means, as well as mini-batch \(k\)-means. Both can be found in Geron (2022).

Exercise 1.1 (Apply the \(k\)-means algorithm to other datasets) As an exercise, to get some practice for using \(k\)-means, apply the algorithm to some other datasets to see how it performs.

We already looked at the Fischer Iris dataset (see the introduction this part of the notes) and discussed some basic features. Try with only two dimensional data, as well as four dimensional data to find the three clusters.
For Figure 1.7 we will look at the moons dataset. Try working out the clusters in this case.
Apply the algorithm to the cats and dogs (in wavelet basis, see the introduction this part of the notes) for various principal components and find the two clusters.

1.1.1 Applications of \(k\)-means

The \(k\)-means algorithm is used in a multitude of applications, we list some here:

Image segmentation: To decompose an image into different regions we can use \(k\)-means. Applications range from robotics to surveillance, where one or more objects are separated from the rest of the image.
Customer segmentation/social network analysis: To segment customers along e.g. their purchase history/behaviour, preferences, demographic data, amount spent, search queries, social media interactions, etc. \(k\)-means is used in marketing, retail, and advertising to personalize the experience.
Text clustering: In natural language processing (NLP) \(k\)-means is often used to cluster similar documents or text to make analysing large volumes of text data feasible.
Fraud detection: \(k\)-means is a crude tool for fraud detection in finance and banking. Transactions are clustered according to similarities and anomalies are detected. There exist more sophisticated methods in finance, but \(k\)-means is an easy to understand start.
Anomaly detection: In medical (image) data \(k\)-means is often used to detect anomalies by finding points that fall outside of clusters. The same works for cybersecurity and e.g. network traffic.
Recommendation systems: By grouping users together it is easier to recommend them new songs, items for shopping, and more.
Quality control: By grouping similar products we can detect quality issues and defects in production processes. If an issue is found the part can be checked and the process adjusted.
Traffic Analysis: In transport and logistics we can analyze traffic patterns and use the information for optimization and similar trips, routes, and vehicles.

We want to highlight image segmentation as an application of \(k\)-means. The example found here is first shown in Geron (2022).

The idea of image segmentation is to decompose an image into different segments. The following variants exist:

Colour segmentation - pixels with similar colour get assigned the same cluster. An application of this is image satellite postprocessing to find areas of forest or sea, or finding object in robotics applications, and organs in medical images.
Semantic segmentation - all pixels that are part of the same object get assigned to the same segment. An application of this is in autonomous driving to find pedestrians - one segment containing all pedestrians.
Instance segmentation - similar to semantic segmentation but individual objects are assigned to the same segment. In the autonomous driving application we would find individual pedestrians.

In Figure 1.4 and the corresponding code we show how to perform colour segmentation with \(k\)-means.

Important

The following dataset and the basic structure of the code is from Geron (2022), see GitHub,

Show the code for the figure

import numpy as np
import imageio.v3 as iio
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

im = np.asarray(iio.imread(
        "https://github.com/ageron/handson-ml3/blob/main/images/"
        "unsupervised_learning/ladybug.png?raw=true"))
plt.figure()
plt.imshow(im)
plt.axis("off")

Z = im.reshape(-1, 3)
for colours in [8, 4, 2]:
    kmeans = KMeans(n_clusters=colours, n_init=10, random_state=6020).fit(Z)
    plt.figure()
    plt.imshow(kmeans.cluster_centers_[kmeans.labels_].reshape(im.shape) / 255)
    plt.axis("off")

plt.show()

1.2 Unsupervised hierarchical clustering - Dendrogram

Similar to \(k\)-means, a simple hierarchical algorithm is used to create a dendrogram. The resulting tree allows us to easily see if data is clustered without the need of labeling or supervision.

We follow the example and discussion given in (Brunton and Kutz 2022, sec. 5.4).

There are two main approaches in creating the desired hierarchy: bottom-up often called agglomerative, top-down often called divisive.

For the agglomerative approach each observation \(x_i\) is initially its own cluster and in each step points are combined according to their distance. Eventually everything ends up in a single cluster and the algorithm stops.
For the divisive approach we go the opposite direction and start with the super cluster containing all observations \(x_i\) and we gradually split up into smaller and smaller clusters. The algorithm stops, when each observation is its own leave.

Of course the norm¹ used has quite an influence as can be seen in (Kandolf 2025, sec. 7.1) Link where we compared LASSO and RIDGE algorithms for our optimization problem.

To illustrate the agglomerative approach and the difference in norms we use four points and construct the hierarchy by combining the two closest points, see Figure 1.5 (a) and Figure 1.5 (b).

(a) Use two norm or euclidean norm to compute the distance - \(\| \cdot \|_2\)

On a larger scale, with always the first 25 points of the two clusters above we get the results shown in Figure 1.6

Show the code for the figure

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist
from scipy.cluster import hierarchy
%config InlineBackend.figure_formats = ["svg"]
np.random.seed(6020)

XX = np.concatenate((X[:25, :], X2[:25, :]))
XX.shape
for metric in ["euclidean", "cityblock"]:
    plt.figure()
    Y = pdist(XX, metric=metric)
    Z = hierarchy.linkage(Y, method="average", metric=metric)
    thresh = 0.90 * np.max(Z[:, 2])
    dn = hierarchy.dendrogram(Z, p=100, color_threshold=thresh)

    plt.figure()
    plt.bar(range(XX.shape[0]), dn["leaves"])
    plt.plot(np.array([0, XX.shape[0]]),
             np.array([XX.shape[0] // 2, XX.shape[0] // 2]),
             "r:", linewidth=2)
    plt.plot(np.array([(XX.shape[0] // 2) + 1/2, (XX.shape[0] // 2) + 1/2]),
             np.array([0, XX.shape[0]]),
             'r:', linewidth=2)

plt.show()

The two dendrograms (Figure 1.6 (a) and Figure 1.6 (c)) show the hierarchical structure derived from the dataset. The number of clusters can be influenced by the thresh parameter and it is also used to label the observation accordingly. It is quite similar to the number of clusters \(k\) in the \(k\)-means algorithm.

The two bar graphs on the right (Figure 1.6 (b) and Figure 1.6 (d)) show how the observations is clustered in the dendrogram. The bars correspond to the distance metric produced by the algorithm. The red lines indicate the region of a perfect split for the separation if the algorithm works perfectly and places every point in the original cluster. If we recall, that the first 25 points are from set 1 and the next 25 from set 2 we can see that the euclidean distance and the cityblock norm place one point in the wrong cluster.

Note

The function scipy.cluste.hierarchy.linkage allows to specify the method of computing the distance between two clusters. The used average corresponds to unweighted pair group method with arithmetic mean UPGMA algorithm.

Note

In scikit-learn this is supported via the sklearn.cluster.AgglomerativeClustering class.

Exercise 1.2 (Apply the agglomerative clustering to the Fischer Iris dataset)

Work out how to use the scikit-learn alternative.
We already looked at the Fischer Iris dataset (see the introduction this part of the notes) and discussed some basic features. Try with only two dimensional data, as well as four dimensional data, to find the three clusters.
For Figure 1.7 we will look at the moons dataset. Try working out the clusters in this case, look at the different possibilities with the linkage option as {"ward", "complete", "average", "single"}.

1.3 Density-based spatial clustering of applications with noise (DBSCAN)

The algorithm was introduced in Ester et al. (1996) and is an algorithm that finds areas of high density. The main idea behind the algorithm is as follows.

Our observations (we will call them points here for consistency) form the basis and they can be in any space but as for all the algorithms presented here, there needs to exist some form to measure distance. Furthermore, DBSCAN relies on two parameters, \(\epsilon\) describing the radius of neighbourhood of a point and \(minPts\) the minimum of points needed to form a cluster. With these parameters we perform the following steps:

A point \(p\) is called a core point if at least \(minPts\) (including \(p\)) are within distance \(\epsilon\) of \(p\). This region is called the \(\epsilon\)-neighbourhood of \(p\).
A point \(q\) is called directly reachable from a core point \(p\) iff² \(q\) is in the \(\epsilon\)-neighbourhood of \(p\).
A point \(p\) is reachable from \(p\), if there exists a sequence of points (path) \(p=p_1, p_2, \ldots, p_n=q\) where each \(p_{i+1}\) is directly reachable form \(p_i\). Consequently, all point on the path, except \(p\), need to be core points.
A point that is not reachable from any other point is considered an outlier or noise point.

Therefore, we get our definition of a cluster: a core point \(p\) together with all points that are reachable from \(p\). This includes core points (at least one) and none-core points (boundary points) on the boundary as they can not be used to reach more points.

In contrast to \(k\)-means the algorithm we can find non-linear (i.e. curved) clusters. We illustrate this with the moon dataset.

Important

For the plotting function please see GitHub, the accompanying repository to Geron (2022), as it forms the basis.

Show the code for the figure

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

def plot_dbscan(dbscan, X, size):
    core_mask = np.zeros_like(dbscan.labels_, dtype=bool)
    core_mask[dbscan.core_sample_indices_] = True
    anomalies_mask = dbscan.labels_ == -1
    non_core_mask = ~(core_mask | anomalies_mask)

    cores = dbscan.components_
    anomalies = X[anomalies_mask]
    non_cores = X[non_core_mask]
    
    # Plotting the clusters
    plt.scatter(cores[:, 0], cores[:, 1],
                c=dbscan.labels_[core_mask],
                marker="o", s=size, cmap="Paired")
    # Plotting the core points of the clusters
    plt.scatter(cores[:, 0], cores[:, 1],
                c=dbscan.labels_[core_mask],
                marker="*", s=20,)
    # Plotting the anomalies
    plt.scatter(anomalies[:, 0], anomalies[:, 1],
                c="r", marker="x", s=100)
    # Plotting the boundary points
    plt.scatter(non_cores[:, 0], non_cores[:, 1],
                c=dbscan.labels_[non_core_mask], marker=".")
    plt.xlabel("$x_1$")
    plt.ylabel("$x_2$", rotation=0)
    plt.gca().set_aspect(1)

np.random.seed(6020)
M, M_y = make_moons(n_samples=1000, noise=0.05, random_state=6020)

dbscan = DBSCAN(eps=0.05, min_samples=5)
dbscan.fit(M)
plt.figure()
plot_dbscan(dbscan, M, size=100)

dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan.fit(M)
plt.figure()
plot_dbscan(dbscan, M, size=600)
plt.show()

(a) Scikit-Learn’s DBSCAN with `eps=0.05` and `min_samples=5`.

The algorithm works best for clusters well separated by low density regions. As all algorithms depending on distance measures, it suffers from the curse of dimensionality³.

As can be seen in Figure 1.7 (a) the algorithm produces quite a lot of clusters for a small \(\epsilon=0.05\). In total we get 10 clusters and quite a lot of outliers that are not all easy to retract by the naked eye.

Nevertheless, if we increase \(\epsilon=0.2\) we can see that the two clusters are neatly reproduced, Figure 1.7 (b).

Note

The DBSCAN class in Scikit-Learn does not provide a .predict() method like many other such classes. See Geron (2022) and GitHub for how to train a \(k\)-nearest neighbour (kNN) to perform this task.

Exercise 1.3 (Application of DBSCAN)

Apply the DBSCAN algorithm to our toy example (the two ellipsoids from Section 1.1) used through this section, to recover the two clusters as good as possible. Try different \(\epsilon\) values as well as different norms (metric parameter).
Additionally, for a higher dimensional problem, using DBSCAN to split the Fischer Iris dataset (see the introduction this part of the notes) into its three clusters of flowers.

1.4 Finite Mixtures Models

Finite mixture models assume that the observations \(x_i\) are mixtures of \(k\) processes, which combine in the measurement. Each mixture is defined via a probability model with unknown parameters. The aim of the algorithm is to find the parameters such that the \(k\) processes best describe the \(x_i\) observations. The basis of the algorithm is the so called expectation-maximization (EM) algorithm of Dempster, Laird, and Rubin (1977), that is designed to find the maximum-likelihood parameters of the defined statistical models.

The most common choice for the probability models is the Gaussian distribution⁴ and for this choice, the method is better known as Gaussian mixture models (GMM). This is often quite a sensible choice, if nothing more is known about the observations, and a Gaussian distribution is therefore a save bet. As a result each process can be described by two parameters, mean and variance. Each cluster in turn, will be described by an ellipsoidal shape where size, density, orientation, and semi-axis vary.

Like \(k\)-means, in the simplest version, GMM is provided with the number of clusters \(k\) and starts from an initial guess of the means and corresponding variances. The parameters are than iteratively updated to find a (local) maximum. There are some problems with this approach, e.g. if we set one of the processes to have zero variance and mean equal to a point. To avoid such problems cross-validation can often help to avoid problems stemming from an initialization with a poor initial guess.

As mentioned, the main idea of the mixture models is that the observations are a linear combination of probability density functions (PDF) \[ f(x_j, \Theta) = \sum_{p=1}^k \alpha_p f_p(x_j, \Theta_p), \] where \(f\) denotes the observed/measured PDF with parameters \(\Theta\), \(f_p\) the PDF of mixture \(p\) with parameters \(\Theta_p\), and \(k\) denotes the number of mixtures. The weights \(\alpha_p\) fulfil \(\sum_p \alpha_p = 1\).

Overall we therefore can formulate the algorithm as:

Definition 1.2 (Mixture Models Algorithm) Given the observed PDF \(f(x_j, \Theta)\), estimate the mixture weights \(\alpha_p\) and the parameters fo the distribution \(\Theta_p\).

(see Brunton and Kutz 2022)

Lets follow the derivative for the GMM method in (Brunton and Kutz 2022, sec. 5.5) to get a better idea what is happening.

For GMM, \(f\) reads as \[ f(x_j, \Theta) = \sum_{p=1}^k \alpha_p \mathcal{N}_p(x_j, \mu_p, \sigma_p), \] and for a given \(k\), we need to find \(\alpha_p, \mu_p, \sigma_p\). We get \(\Theta\) from the roots of \[ L(\Theta) = \sum_{j=1}^n\log f(x_j | \Theta). \] \(L\) is called the log-likelihood function and we sum over all observations \(x_j\). We transform it into an optimization problem by setting the derivative to zero \[ \frac{\partial L(\Theta)}{\partial \Theta} = 0, \] and solve it via the EM algorithm. As the name suggests, there are two steps E and M.

The E step uses the following posterior to establish memberships. Therefore, we start of by assuming an initial guess of the vector \(\Theta\) and this leads to the _posterior probability of distribution \(p\) of \(x_j\) \[ \tau_p(x_j, \Theta) = \frac{\alpha_p f_p(x_j, \Theta_p)}{f(x_j, \Theta)}. \] In other words, we try to figure out if \(x_j\) is part of the \(p\)th mixture. For GMM this becomes, \[ \tau_p^{(l)}(x_j) = \frac{\alpha_p^{(l)} \mathcal{N}_p(x_j, \mu_p^{(l)}, \sigma_p^{(l)})}{f(x_j, \Theta^{(l)})}, \] in the \(l\) iteration.

Now the M step starts to update the parameters and mixture weights as \[ \begin{aligned} \alpha_p^{(l+1)} &= \frac1n \sum_{j=1}^n \tau_p^{(l)}(x_j), \\ \mu_p^{(l+1)} &= \frac{\sum_{j=1}^n x_j\tau_p^{(l)}(x_j)}{\sum_{j=1}^n \tau_p^{(l)}(x_j)}, \\ \Sigma_p^{(l+1)} &= \frac{\sum_{j=1}^n \tau_p^{(l)}(x_j)[(x_j - \mu_p^{(l+1)})(x_j - \mu_p^{(l+1)})^\mathrm{T}]}{\sum_{j=1}^n \tau_p^{(l)}(x_j)}, \end{aligned} \] with \(\Sigma_p^{(l+1)}\) denotes the covariance matrix. The algorithm now alternates between E and M until convergence is reached.

Let us try it with the cats and dogs images in wavelet basis from the introduction by fitting to the second and forth principal value of the SVD. Note, we directly load the images in wavelet basis and do not recompute them.

Important

The following dataset and the basic structure of the code is from (Brunton and Kutz 2022, Code 5.8). Also see GitHub.

Show the code for the figure

import numpy as np
import scipy
import requests
import io
from sklearn.mixture import GaussianMixture
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/catData_w.mat")
cats = scipy.io.loadmat(io.BytesIO(response.content))["cat_wave"]

response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/dogData_w.mat")
dogs = scipy.io.loadmat(io.BytesIO(response.content))["dog_wave"]

CD = np.concatenate((dogs, cats), axis=1)
U,S,VT = np.linalg.svd(CD - np.mean(CD), full_matrices=0)
v = VT.T

dogcat = v[:,(1,3)]
GMModel = GaussianMixture(n_components=2, n_init=10).fit(dogcat)

plt.figure()
plt.scatter(v[:80,1], v[:80,3], label="dogs")
plt.scatter(v[80:,1], v[80:,3], label="cats")
plt.xlabel(r"$PC_1$")
plt.ylabel(r"$PC_4$")

x = np.linspace(-0.15, 0.25)
y = np.linspace(-0.25, 0.2)
X, Y = np.meshgrid(x, y)
XX = np.array([X.ravel(), Y.ravel()]).T
Z = np.exp(GMModel.score_samples(XX))
Z = Z.reshape(X.shape)

CS = plt.contour(X, Y, Z,
                levels=np.arange(3,32,4), colors='k', linestyles='solid')
plt.legend()
plt.gca().set_aspect(1)

plt.figure()
ax = plt.axes(projection='3d')
for i in range(GMModel.weights_.shape[0]):
    rv = multivariate_normal(GMModel.means_[i], GMModel.covariances_[i])
    z = GMModel.weights_[i] * rv.pdf(np.dstack((X, Y)))
    ax.plot_surface(X, Y, z, alpha=0.5)

ax.plot_wireframe(X, Y, Z, color='black', rstride=5, cstride=5)
ax.view_init(30, -80)
plt.xlabel(r"$PC_1$")
plt.ylabel(r"$PC_4$")
plt.legend([r"$\mathcal{N}(\mu_1, \sigma_1)$",
            r"$\mathcal{N}(\mu_2, \sigma_2)$"])
plt.show()

(a) We can see a nice split along the tow animals via the Gaussian distributions.

As can be seen in Figure 1.8 (a) the two clusters are quite well captured by the GMM algorithm. The contour shows the linear combination of the two Gaussians. We see the two Gaussians on the right in Figure 1.8 (b) where we use the same colours as for the cats and dogs respectively. Note that the distributions are weight \(\alpha_p\) from GMM but other than that there is no unit in the z direction.

As discussed above, the algorithm is prone to problems during the initialization, therefore we use n_init = 10 to initialize the algorithm 10 times and only keep the best result.

We can draw new samples from the resulting object .sample() and we can also fit data to it and we can also perform hard and soft clustering. For hard clustering the model assigns each instance to the most likely cluster (.predict()), where for soft clustering (.predict_proba()) we get the probability of membership for each cluster.

If the algorithm struggles to recover the clusters, we can help it by imposing the shape of the cluster via restricting the covariance matrix \(\Sigma_p\). This can be done via the parameter covariance_type, see docs.

Exercise 1.4 (Application of GMM)

Apply the GMM algorithm to our toy example (the two ellipsoids from Section 1.1) used through this section to recover the two clusters as good as possible. This should be straight forward, as we constructed it via Gaussian distributions. Try constraining the algorithm by imposing the different values possible for covariance_type.
Additionally, for a higher dimensional problem, split the Fischer Iris dataset into its three clusters of flowers.
Try to use GMM to get clusters for the moon dataset used in Section 1.3. Also try with the BayesianGaussianMixture class, where no amount of clusters needs to be specified.

Tip

GMM can also be used for anomaly detection. Simply assume every observation located in a low density region an anomaly. For this the threshold needs to be defined in accordance to our expected anomalies.

E.g. to get about \(4\%\) anomalies we can use the following snippet together with the above code:

Show the code for the figure

S = GMModel.score_samples(dogcat)
density_threshold = np.percentile(S, 2)
anomalies = dogcat[S < density_threshold]
plt.figure()
plt.scatter(dogcat[:80,0], dogcat[:80,1], alpha=0.5, label="dogs")
plt.scatter(dogcat[80:,0], dogcat[80:,1], alpha=0.5, label="cats")
plt.xlabel(r"$PC_1$")
plt.ylabel(r"$PC_4$")
plt.scatter(anomalies[:, 0], anomalies[:, 1], color="k", marker="x",
            label="anomaly")
plt.legend()
plt.show()

Figure 1.9: Fitting of the second and fourth principal value of the cats and dogs images in wavelet basis via the GMM method.

See Geron (2022) and the corresponding GitHub repository.

1.5 Motivational example

Before we make the leap to supervised learning, we we introduce a new dataset and see how our unsupervised approach works in this case.

We do this with the (in)famous MNIST dataset of hand written digits. Lets start by looking at the dataset and establishing some basic properties.

We load the dataset and look at its description

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', as_frame=False)
print(mnist.DESCR)

**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date
unknown
**Please cite**:

The MNIST database of handwritten digits with 784 features, raw data
available at: http://yann.lecun.com/exdb/mnist/. It can be split in a
training set of the first 60,000 examples, and a test set of 10,000
examples

It is a subset of a larger set available from NIST. The digits have
been size-normalized and centered in a fixed-size image. It is a good
database for people who want to try learning techniques and pattern
recognition methods on real-world data while spending minimal efforts
on preprocessing and formatting. The original black and white
(bilevel) images from NIST were size normalized to fit in a 20x20
pixel box while preserving their aspect ratio. The resulting images
contain grey levels as a result of the anti-aliasing technique used by
the normalization algorithm. the images were centered in a 28x28 image
by computing the center of mass of the pixels, and translating the
image so as to position this point at the center of the 28x28 field.

With some classification methods (particularly template-based methods,
such as SVM and K-nearest neighbors), the error rate improves when the
digits are centered by bounding box rather than center of mass. If you
do this kind of pre-processing, you should report it in your
publications. The MNIST database was constructed from NIST's NIST
originally designated SD-3 as their training set and SD-1 as their
test set. However, SD-3 is much cleaner and easier to recognize than
SD-1. The reason for this can be found on the fact that SD-3 was
collected among Census Bureau employees, while SD-1 was collected
among high-school students. Drawing sensible conclusions from learning
experiments requires that the result be independent of the choice of
training set and test among the complete set of samples. Therefore it
was necessary to build a new database by mixing NIST's datasets.

The MNIST training set is composed of 30,000 patterns from SD-3 and
30,000 patterns from SD-1. Our test set was composed of 5,000 patterns
from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training
set contained examples from approximately 250 writers. We made sure
that the sets of writers of the training set and test set were
disjoint. SD-1 contains 58,527 digit images written by 500 different
writers. In contrast to SD-3, where blocks of data from each writer
appeared in sequence, the data in SD-1 is scrambled. Writer identities
for SD-1 is available and we used this information to unscramble the
writers. We then split SD-1 in two: characters written by the first
250 writers went into our new training set. The remaining 250 writers
were placed in our test set. Thus we had two sets with nearly 30,000
examples each. The new training set was completed with enough examples
from SD-3, starting at pattern # 0, to make a full set of 60,000
training patterns. Similarly, the new test set was completed with SD-3
examples starting at pattern # 35,000 to make a full set with 60,000
test patterns. Only a subset of 10,000 test images (5,000 from SD-1
and 5,000 from SD-3) is available on this site. The full 60,000 sample
training set is available.

Downloaded from openml.org.

The images are stored in .data and the label in .target. We can look at the first 100 digits

Show the code for the figure

import numpy as np
from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

mnist = fetch_openml('mnist_784', as_frame=False)

im = np.zeros((10 * 28, 10 * 28), int)
for i in range(10):
    for j in range(10):
        im[28*i:28*(i+1), 28*j:28*(j+1)] = mnist.data[i*10 + j].reshape(28, 28)
plt.figure()
plt.imshow(im, cmap="binary")
plt.axis("off")

plt.show()

Figure 1.10: First 100 digits from the MNIST dataset.

As can be seen from Figure 1.10, most digits are easy to distinguish for us, but there are some nasty things included. Especially, there are 1s with the english style (just one straight line) of writing (1st row, 7th column), but also as a diagonal line (1st row, 4th column), and finally also the german style (connecting line at base and top) of writing (3rd row, 5th column).

Also the 2 digit comes in various forms, with loop (1st row, 6th column) and without (3rd row, 6th column). Finally we all know that a 3 can hide in an 8, same as a 7 and 1 can hide in a 9 and much more.

Now, before we go any further in the investigation we split up the dataset into a training and a test set, we will not use this here but we should establish some way of working right from the start.

X_train, X_test = mnist.data[:60000], mnist.data[60000:]
y_train, y_test = mnist.target[:60000], mnist.target[60000:]

Now let us try with \(k\)-means to find the ten digits by finding \(10\) clusters.

Show the code for the figure

import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

k = 10
kmeans = KMeans(n_clusters=k, n_init=1, random_state=6020).fit(X_train)

im = np.zeros((1 * 28, 10 * 28), int)
for i in range(1):
    for j in range(10):
        im[28*i:28*(i+1), 28*j:28*(j+1)] = \
                kmeans.cluster_centers_[i*10 + j].reshape(28, 28)
plt.figure()
plt.imshow(im, cmap="binary")
plt.axis("off")

plt.show()

Figure 1.11: The cluster means of \(k\)-means for 10 clusters.

As can be seen from Figure 1.11, the cluster centers do not recover our ten digits. In general, they are, as expected, smudged as they do not represent an actual image. It looks like \(0, 1, 2, 3, 6, 8, 9\) are easy to discover but \(4, 5, 7\) not. If we look closely, we can see \(4, 5, 7\) represented in our clusters but not as separate digits. Obviously this is not a good way to proceed and to find our digits. We will discuss methods to perform this task in the next section Chapter 2. Furthermore, keep this part in mind, as we will come back to how this can still be helpful in a different aspect as semi-supervised learning Chapter 3.

see Kandolf (2025), Section 1.2 or follow the direct link ↩︎
short for if and only if↩︎
Problems occurring for higher dimensional data and distance measures, see WikiPedia ↩︎
see Kandolf (2025), Example 14.6 or for a direct Link ↩︎