We start with unsupervised learning. The goal of unsupervised learning is to discover clusters in the data of observations that have no labels, i.e. we have nothing to look at as reference. There are several algorithms to perform this task, the most prominent is the \(k\)-means clustering algorithm.
Tip
We will use scikit-learn in this section heavily. The User Guide and Examples sections in the documentation are well structured and a good source for additional information.
Furthermore, there are several Examples that compare different methods on different datasets and therefore give a nice overview on strengths and weaknesses. Have look at them to gain further insights into the methods and their differences.
The \(k\)-means algorithm tries to partition a set of \(m\) (vector-valued) data observations into \(k\) clusters. Where in general, the dimension of the data and the amount of observations is known, the number of clusters is often not known a priori.
The general idea is, to label each observation as belonging to a cluster with the nearest mean (the representative of the cluster). The resulting clusters are called Voronoi cells, see Wikipedia - Voronoi diagram.
Definition 1.1 (\(k\)-Means Algorithm) For a given set of \(m\) observations \((x_1, x_2, \ldots, x_m)\), with \(x_i\in \mathrm{R}^n\) the algorithm strives to find \(k\) sets \((S_1, S_2, \ldots, S_k)\) such that the variance inside the cluster is minimized, i.e. \[
\underset{S}{\operatorname{argmin}} \sum_{i=1}^k \sum_{x\in S_i}\| x - \mu_i \|^2,
\] where \(\mu_i\) denotes the mean of \(S_i\).
The algorithm itself is recursive, for a given \(k\)
Randomly initialize \(k\) points \(\mu_1, \mu_2, \ldots, \mu_k\), as the cluster centers,
Label each observation \(x_i\) by the nearest cluster center \(\mu_j\), all points with the same label form the set \(S_j\).
Compute the mean of each cluster \(S_j\) and replace the current \(\mu_j\) by it.
Repeat, starting from step 2, until the cluster centers stay stable up to some tolerance.
This algorithm was first introduced in Lloyd (1982) and is therefore often called Lloyd algorithm.
In general Definition 1.1 is NP-hard and therefore computationally not viable. Nevertheless, a number of heuristic algorithms exist to provide a good performance, despite having no guarantee for global convergence.
We illustrate the proceedings with the help of some artificial observations in two dimensional space, and show how the clustering takes place.
Important
The following dataset and the basic structure of the code is from (Brunton and Kutz 2022, Code 5.5 - 5.6). Also see GitHub.
Show the code for the figure
import numpy as npimport matplotlib.pyplot as plt%config InlineBackend.figure_formats = ["svg"]np.random.seed(6020)# Helper for plottingdef plot_lloyd(data, centers, classes, ms=500): plt.figure()for i inrange(centers.shape[0]): plt.scatter(data[classes==i, 0], data[classes==i, 1]) plt.scatter(centers[i, 0], centers[i, 1], c="k", s=ms, marker="*") plt.gca().set_aspect(1)# Create data for illustrationn =200# Random ellipse centred in (0, 0) and axis (1, 0.5)X = np.random.randn(n, 2) * np.array([1, 0.5])# Random ellipse centred in (1, -2) and axis (1, 0.2)X2 = (np.random.randn(n, 2) * np.array([1, 0.2])) + np.array([1, -2])# Rotate ellipse 2 by thetatheta = np.pi /4X2 = X2 @ np.array([[np.cos(theta), -np.sin(theta)], [np.sin(theta), np.cos(theta)]] )centers = np.array([[0., -1.], [-1., 0.]])data = np.concatenate((X, X2))# Plot initial step with theoretical assignmentclasses = np.concatenate((np.ones(X2.shape[0]), np.zeros(X.shape[0])))plot_lloyd(data, centers, classes)# Compute and plot consecutive steps of lloyds algorithmfor i in [1, 2, 5]: classes, centers_ = lloyd(data, centers, i) plot_lloyd(data, centers, classes) centers = centers_plt.show()
(a) The original two classes and the initial guesses for the centers.
(b) Point association to the clusters in the first step.
(c) Point association to the clusters in the third step.
(d) Point association to the clusters in the eighth step.
Figure 1.1: Lloyds algorithm in action for a generic set of two datasets generated with a Gaussian distribution.
In Figure 1.1 (a) we see the two distinct clusters and the initial guesses for the centers. In the successive plots, we see how the centers move and converge to the final position, as seen in Figure 1.1 (d). In this case the algorithm converges after the sixth step.
Of course the algorithm is sensitive to the initial guess and therefore modern versions provide strategies to determine the initial guess, as well as the number of clusters.
Figure 1.2: KMeans algorithm in action for a generic set of two datasets generated with a Gaussian distribution.
As can be seen in Figure 1.2 (a) the algorithm comes up with (almost) the same split between the two sets as our crude implementation. If we try for three clusters Figure 1.2 (b) the result is sensible as well.
The major success, of this algorithm in general use, is based on the fast convergence, that requires no supervision. We can also see that it is not very accurate, compare Figure 1.1 (a) and Figure 1.1 (d). This is no surprise, as the algorithm has not all information available.
How can we determine how accurate the algorithm is? If we have no labels this is of course not easy to do but cross-validation is a good tool.
In our case we can produce the labels and we can also split the data beforehand into a training set and a test set. Usually a so called \(80:20\) split is used, i.e. \(80\%\) training data and \(20\%\) test data.
The result of this can be seen in Figure 1.3 where we have two points wrongly classified to the opposite cluster.
There exist several extensions of the basic \(k\)-means algorithm to improve the results and overall performance. Two such versions are the accelerated \(k\)-means, as well as mini-batch \(k\)-means. Both can be found in Geron (2022).
Exercise 1.1 (Apply the \(k\)-means algorithm to other datasets) As an exercise, to get some practice for using \(k\)-means, apply the algorithm to some other datasets to see how it performs.
We already looked at the Fischer Iris dataset (see the introduction this part of the notes) and discussed some basic features. Try with only two dimensional data, as well as four dimensional data to find the three clusters.
For Figure 1.7 we will look at the moons dataset. Try working out the clusters in this case.
Apply the algorithm to the cats and dogs (in wavelet basis, see the introduction this part of the notes) for various principal components and find the two clusters.
1.1.1 Applications of \(k\)-means
The \(k\)-means algorithm is used in a multitude of applications, we list some here:
Image segmentation: To decompose an image into different regions we can use \(k\)-means. Applications range from robotics to surveillance, where one or more objects are separated from the rest of the image.
Customer segmentation/social network analysis: To segment customers along e.g. their purchase history/behaviour, preferences, demographic data, amount spent, search queries, social media interactions, etc. \(k\)-means is used in marketing, retail, and advertising to personalize the experience.
Text clustering: In natural language processing (NLP) \(k\)-means is often used to cluster similar documents or text to make analysing large volumes of text data feasible.
Fraud detection: \(k\)-means is a crude tool for fraud detection in finance and banking. Transactions are clustered according to similarities and anomalies are detected. There exist more sophisticated methods in finance, but \(k\)-means is an easy to understand start.
Anomaly detection: In medical (image) data \(k\)-means is often used to detect anomalies by finding points that fall outside of clusters. The same works for cybersecurity and e.g. network traffic.
Recommendation systems: By grouping users together it is easier to recommend them new songs, items for shopping, and more.
Quality control: By grouping similar products we can detect quality issues and defects in production processes. If an issue is found the part can be checked and the process adjusted.
Traffic Analysis: In transport and logistics we can analyze traffic patterns and use the information for optimization and similar trips, routes, and vehicles.
We want to highlight image segmentation as an application of \(k\)-means. The example found here is first shown in Geron (2022).
The idea of image segmentation is to decompose an image into different segments. The following variants exist:
Colour segmentation - pixels with similar colour get assigned the same cluster. An application of this is image satellite postprocessing to find areas of forest or sea, or finding object in robotics applications, and organs in medical images.
Semantic segmentation - all pixels that are part of the same object get assigned to the same segment. An application of this is in autonomous driving to find pedestrians - one segment containing all pedestrians.
Instance segmentation - similar to semantic segmentation but individual objects are assigned to the same segment. In the autonomous driving application we would find individual pedestrians.
In Figure 1.4 and the corresponding code we show how to perform colour segmentation with \(k\)-means.
Important
The following dataset and the basic structure of the code is from Geron (2022), see GitHub,
Show the code for the figure
import numpy as npimport imageio.v3 as iiofrom sklearn.cluster import KMeansimport matplotlib.pyplot as plt%config InlineBackend.figure_formats = ["svg"]im = np.asarray(iio.imread("https://github.com/ageron/handson-ml3/blob/main/images/""unsupervised_learning/ladybug.png?raw=true"))plt.figure()plt.imshow(im)plt.axis("off")Z = im.reshape(-1, 3)for colours in [8, 4, 2]: kmeans = KMeans(n_clusters=colours, n_init=10, random_state=6020).fit(Z) plt.figure() plt.imshow(kmeans.cluster_centers_[kmeans.labels_].reshape(im.shape) /255) plt.axis("off")plt.show()
(a) Original image.
(b) Segmentation by 8 colours.
(c) Segmentation by 4 colours.
(d) Segmentation by 2 colours.
Figure 1.4: Colour segmentation for the image of a lady bug.
Similar to \(k\)-means, a simple hierarchical algorithm is used to create a dendrogram. The resulting tree allows us to easily see if data is clustered without the need of labeling or supervision.
There are two main approaches in creating the desired hierarchy: bottom-up often called agglomerative, top-down often called divisive.
For the agglomerative approach each observation \(x_i\) is initially its own cluster and in each step points are combined according to their distance. Eventually everything ends up in a single cluster and the algorithm stops.
For the divisive approach we go the opposite direction and start with the super cluster containing all observations \(x_i\) and we gradually split up into smaller and smaller clusters. The algorithm stops, when each observation is its own leave.
Of course the norm1 used has quite an influence as can be seen in (Kandolf 2025, sec. 7.1)Link where we compared LASSO and RIDGE algorithms for our optimization problem.
To illustrate the agglomerative approach and the difference in norms we use four points and construct the hierarchy by combining the two closest points, see Figure 1.5 (a) and Figure 1.5 (b).
(a) Use two norm or euclidean norm to compute the distance - \(\| \cdot \|_2\)
(b) Use one or cityblock norm to compute the distance - \(\| \cdot \|_1\)
Figure 1.5: We always combine the two nearest points and replace them with the point in the middle. This iteration continues until only one point is left. The Dendrogram is build according to the order of points chosen. It can also include the distance.
On a larger scale, with always the first 25 points of the two clusters above we get the results shown in Figure 1.6
Figure 1.6: Construction a agglomerative hierarchy for our dataset.
The two dendrograms (Figure 1.6 (a) and Figure 1.6 (c)) show the hierarchical structure derived from the dataset. The number of clusters can be influenced by the thresh parameter and it is also used to label the observation accordingly. It is quite similar to the number of clusters \(k\) in the \(k\)-means algorithm.
The two bar graphs on the right (Figure 1.6 (b) and Figure 1.6 (d)) show how the observations is clustered in the dendrogram. The bars correspond to the distance metric produced by the algorithm. The red lines indicate the region of a perfect split for the separation if the algorithm works perfectly and places every point in the original cluster. If we recall, that the first 25 points are from set 1 and the next 25 from set 2 we can see that the euclidean distance and the cityblock norm place one point in the wrong cluster.
Note
The function scipy.cluste.hierarchy.linkage allows to specify the method of computing the distance between two clusters. The used average corresponds to unweighted pair group method with arithmetic mean UPGMA algorithm.
Exercise 1.2 (Apply the agglomerative clustering to the Fischer Iris dataset)
Work out how to use the scikit-learn alternative.
We already looked at the Fischer Iris dataset (see the introduction this part of the notes) and discussed some basic features. Try with only two dimensional data, as well as four dimensional data, to find the three clusters.
For Figure 1.7 we will look at the moons dataset. Try working out the clusters in this case, look at the different possibilities with the linkage option as {"ward", "complete", "average", "single"}.
1.3 Density-based spatial clustering of applications with noise (DBSCAN)
The algorithm was introduced in Ester et al. (1996) and is an algorithm that finds areas of high density. The main idea behind the algorithm is as follows.
Our observations (we will call them points here for consistency) form the basis and they can be in any space but as for all the algorithms presented here, there needs to exist some form to measure distance. Furthermore, DBSCAN relies on two parameters, \(\epsilon\) describing the radius of neighbourhood of a point and \(minPts\) the minimum of points needed to form a cluster. With these parameters we perform the following steps:
A point \(p\) is called a core point if at least \(minPts\) (including \(p\)) are within distance \(\epsilon\) of \(p\). This region is called the \(\epsilon\)-neighbourhood of \(p\).
A point \(q\) is called directly reachable from a core point \(p\) iff2\(q\) is in the \(\epsilon\)-neighbourhood of \(p\).
A point \(p\) is reachable from \(p\), if there exists a sequence of points (path) \(p=p_1, p_2, \ldots, p_n=q\) where each \(p_{i+1}\) is directly reachable form \(p_i\). Consequently, all point on the path, except \(p\), need to be core points.
A point that is not reachable from any other point is considered an outlier or noise point.
Therefore, we get our definition of a cluster: a core point \(p\) together with all points that are reachable from \(p\). This includes core points (at least one) and none-core points (boundary points) on the boundary as they can not be used to reach more points.
In contrast to \(k\)-means the algorithm we can find non-linear (i.e. curved) clusters. We illustrate this with the moon dataset.
Important
For the plotting function please see GitHub, the accompanying repository to Geron (2022), as it forms the basis.
(a) Scikit-Learn’s DBSCAN with eps=0.05 and min_samples=5.
(b) Scikit-Learn’s DBSCAN with eps=0.2 and min_samples=5.
Figure 1.7: DBSCAN illustrated with the moons dataset.
The algorithm works best for clusters well separated by low density regions. As all algorithms depending on distance measures, it suffers from the curse of dimensionality3.
As can be seen in Figure 1.7 (a) the algorithm produces quite a lot of clusters for a small \(\epsilon=0.05\). In total we get 10 clusters and quite a lot of outliers that are not all easy to retract by the naked eye.
Nevertheless, if we increase \(\epsilon=0.2\) we can see that the two clusters are neatly reproduced, Figure 1.7 (b).
Note
The DBSCAN class in Scikit-Learn does not provide a .predict() method like many other such classes. See Geron (2022) and GitHub for how to train a \(k\)-nearest neighbour (kNN) to perform this task.
Exercise 1.3 (Application of DBSCAN)
Apply the DBSCAN algorithm to our toy example (the two ellipsoids from Section 1.1) used through this section, to recover the two clusters as good as possible. Try different \(\epsilon\) values as well as different norms (metric parameter).
Additionally, for a higher dimensional problem, using DBSCAN to split the Fischer Iris dataset (see the introduction this part of the notes) into its three clusters of flowers.
1.4 Finite Mixtures Models
Finite mixture models assume that the observations \(x_i\) are mixtures of \(k\) processes, which combine in the measurement. Each mixture is defined via a probability model with unknown parameters. The aim of the algorithm is to find the parameters such that the \(k\) processes best describe the \(x_i\) observations. The basis of the algorithm is the so called expectation-maximization (EM) algorithm of Dempster, Laird, and Rubin (1977), that is designed to find the maximum-likelihood parameters of the defined statistical models.
The most common choice for the probability models is the Gaussian distribution4 and for this choice, the method is better known as Gaussian mixture models (GMM). This is often quite a sensible choice, if nothing more is known about the observations, and a Gaussian distribution is therefore a save bet. As a result each process can be described by two parameters, mean and variance. Each cluster in turn, will be described by an ellipsoidal shape where size, density, orientation, and semi-axis vary.
Like \(k\)-means, in the simplest version, GMM is provided with the number of clusters \(k\) and starts from an initial guess of the means and corresponding variances. The parameters are than iteratively updated to find a (local) maximum. There are some problems with this approach, e.g. if we set one of the processes to have zero variance and mean equal to a point. To avoid such problems cross-validation can often help to avoid problems stemming from an initialization with a poor initial guess.
As mentioned, the main idea of the mixture models is that the observations are a linear combination of probability density functions (PDF) \[
f(x_j, \Theta) = \sum_{p=1}^k \alpha_p f_p(x_j, \Theta_p),
\] where \(f\) denotes the observed/measured PDF with parameters \(\Theta\), \(f_p\) the PDF of mixture \(p\) with parameters \(\Theta_p\), and \(k\) denotes the number of mixtures. The weights \(\alpha_p\) fulfil \(\sum_p \alpha_p = 1\).
Overall we therefore can formulate the algorithm as:
Definition 1.2 (Mixture Models Algorithm) Given the observed PDF \(f(x_j, \Theta)\), estimate the mixture weights \(\alpha_p\) and the parameters fo the distribution \(\Theta_p\).
For GMM, \(f\) reads as \[
f(x_j, \Theta) = \sum_{p=1}^k \alpha_p \mathcal{N}_p(x_j, \mu_p, \sigma_p),
\] and for a given \(k\), we need to find \(\alpha_p, \mu_p, \sigma_p\). We get \(\Theta\) from the roots of \[
L(\Theta) = \sum_{j=1}^n\log f(x_j | \Theta).
\]\(L\) is called the log-likelihood function and we sum over all observations \(x_j\). We transform it into an optimization problem by setting the derivative to zero \[
\frac{\partial L(\Theta)}{\partial \Theta} = 0,
\] and solve it via the EM algorithm. As the name suggests, there are two steps E and M.
The E step uses the following posterior to establish memberships. Therefore, we start of by assuming an initial guess of the vector \(\Theta\) and this leads to the _posterior probability of distribution \(p\) of \(x_j\)\[
\tau_p(x_j, \Theta) = \frac{\alpha_p f_p(x_j, \Theta_p)}{f(x_j, \Theta)}.
\] In other words, we try to figure out if \(x_j\) is part of the \(p\)th mixture. For GMM this becomes, \[
\tau_p^{(l)}(x_j) = \frac{\alpha_p^{(l)} \mathcal{N}_p(x_j, \mu_p^{(l)}, \sigma_p^{(l)})}{f(x_j, \Theta^{(l)})},
\] in the \(l\) iteration.
Now the M step starts to update the parameters and mixture weights as \[
\begin{aligned}
\alpha_p^{(l+1)} &= \frac1n \sum_{j=1}^n \tau_p^{(l)}(x_j), \\
\mu_p^{(l+1)} &= \frac{\sum_{j=1}^n x_j\tau_p^{(l)}(x_j)}{\sum_{j=1}^n \tau_p^{(l)}(x_j)}, \\
\Sigma_p^{(l+1)} &= \frac{\sum_{j=1}^n \tau_p^{(l)}(x_j)[(x_j - \mu_p^{(l+1)})(x_j - \mu_p^{(l+1)})^\mathrm{T}]}{\sum_{j=1}^n \tau_p^{(l)}(x_j)},
\end{aligned}
\] with \(\Sigma_p^{(l+1)}\) denotes the covariance matrix. The algorithm now alternates between E and M until convergence is reached.
Let us try it with the cats and dogs images in wavelet basis from the introduction by fitting to the second and forth principal value of the SVD. Note, we directly load the images in wavelet basis and do not recompute them.
Important
The following dataset and the basic structure of the code is from (Brunton and Kutz 2022, Code 5.8). Also see GitHub.
(a) We can see a nice split along the tow animals via the Gaussian distributions.
(b) The PDFs of the fitted distributions, weighted via the GMM algorithm.
Figure 1.8: Fitting of the second and fourth principal value of the cats and dogs images in wavelet basis via the GMM method.
As can be seen in Figure 1.8 (a) the two clusters are quite well captured by the GMM algorithm. The contour shows the linear combination of the two Gaussians. We see the two Gaussians on the right in Figure 1.8 (b) where we use the same colours as for the cats and dogs respectively. Note that the distributions are weight \(\alpha_p\) from GMM but other than that there is no unit in the z direction.
As discussed above, the algorithm is prone to problems during the initialization, therefore we use n_init = 10 to initialize the algorithm 10 times and only keep the best result.
We can draw new samples from the resulting object .sample() and we can also fit data to it and we can also perform hard and soft clustering. For hard clustering the model assigns each instance to the most likely cluster (.predict()), where for soft clustering (.predict_proba()) we get the probability of membership for each cluster.
If the algorithm struggles to recover the clusters, we can help it by imposing the shape of the cluster via restricting the covariance matrix \(\Sigma_p\). This can be done via the parameter covariance_type, see docs.
Exercise 1.4 (Application of GMM)
Apply the GMM algorithm to our toy example (the two ellipsoids from Section 1.1) used through this section to recover the two clusters as good as possible. This should be straight forward, as we constructed it via Gaussian distributions. Try constraining the algorithm by imposing the different values possible for covariance_type.
Additionally, for a higher dimensional problem, split the Fischer Iris dataset into its three clusters of flowers.
Try to use GMM to get clusters for the moon dataset used in Section 1.3. Also try with the BayesianGaussianMixture class, where no amount of clusters needs to be specified.
Tip
GMM can also be used for anomaly detection. Simply assume every observation located in a low density region an anomaly. For this the threshold needs to be defined in accordance to our expected anomalies.
E.g. to get about \(4\%\) anomalies we can use the following snippet together with the above code:
Before we make the leap to supervised learning, we we introduce a new dataset and see how our unsupervised approach works in this case.
We do this with the (in)famous MNIST dataset of hand written digits. Lets start by looking at the dataset and establishing some basic properties.
We load the dataset and look at its description
from sklearn.datasets import fetch_openmlmnist = fetch_openml('mnist_784', as_frame=False)print(mnist.DESCR)
**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date
unknown
**Please cite**:
The MNIST database of handwritten digits with 784 features, raw data
available at: http://yann.lecun.com/exdb/mnist/. It can be split in a
training set of the first 60,000 examples, and a test set of 10,000
examples
It is a subset of a larger set available from NIST. The digits have
been size-normalized and centered in a fixed-size image. It is a good
database for people who want to try learning techniques and pattern
recognition methods on real-world data while spending minimal efforts
on preprocessing and formatting. The original black and white
(bilevel) images from NIST were size normalized to fit in a 20x20
pixel box while preserving their aspect ratio. The resulting images
contain grey levels as a result of the anti-aliasing technique used by
the normalization algorithm. the images were centered in a 28x28 image
by computing the center of mass of the pixels, and translating the
image so as to position this point at the center of the 28x28 field.
With some classification methods (particularly template-based methods,
such as SVM and K-nearest neighbors), the error rate improves when the
digits are centered by bounding box rather than center of mass. If you
do this kind of pre-processing, you should report it in your
publications. The MNIST database was constructed from NIST's NIST
originally designated SD-3 as their training set and SD-1 as their
test set. However, SD-3 is much cleaner and easier to recognize than
SD-1. The reason for this can be found on the fact that SD-3 was
collected among Census Bureau employees, while SD-1 was collected
among high-school students. Drawing sensible conclusions from learning
experiments requires that the result be independent of the choice of
training set and test among the complete set of samples. Therefore it
was necessary to build a new database by mixing NIST's datasets.
The MNIST training set is composed of 30,000 patterns from SD-3 and
30,000 patterns from SD-1. Our test set was composed of 5,000 patterns
from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training
set contained examples from approximately 250 writers. We made sure
that the sets of writers of the training set and test set were
disjoint. SD-1 contains 58,527 digit images written by 500 different
writers. In contrast to SD-3, where blocks of data from each writer
appeared in sequence, the data in SD-1 is scrambled. Writer identities
for SD-1 is available and we used this information to unscramble the
writers. We then split SD-1 in two: characters written by the first
250 writers went into our new training set. The remaining 250 writers
were placed in our test set. Thus we had two sets with nearly 30,000
examples each. The new training set was completed with enough examples
from SD-3, starting at pattern # 0, to make a full set of 60,000
training patterns. Similarly, the new test set was completed with SD-3
examples starting at pattern # 35,000 to make a full set with 60,000
test patterns. Only a subset of 10,000 test images (5,000 from SD-1
and 5,000 from SD-3) is available on this site. The full 60,000 sample
training set is available.
Downloaded from openml.org.
The images are stored in .data and the label in .target. We can look at the first 100 digits
Figure 1.10: First 100 digits from the MNIST dataset.
As can be seen from Figure 1.10, most digits are easy to distinguish for us, but there are some nasty things included. Especially, there are 1s with the english style (just one straight line) of writing (1st row, 7th column), but also as a diagonal line (1st row, 4th column), and finally also the german style (connecting line at base and top) of writing (3rd row, 5th column).
Also the 2 digit comes in various forms, with loop (1st row, 6th column) and without (3rd row, 6th column). Finally we all know that a 3 can hide in an 8, same as a 7 and 1 can hide in a 9 and much more.
Now, before we go any further in the investigation we split up the dataset into a training and a test set, we will not use this here but we should establish some way of working right from the start.
Figure 1.11: The cluster means of \(k\)-means for 10 clusters.
As can be seen from Figure 1.11, the cluster centers do not recover our ten digits. In general, they are, as expected, smudged as they do not represent an actual image. It looks like \(0, 1, 2, 3, 6, 8, 9\) are easy to discover but \(4, 5, 7\) not. If we look closely, we can see \(4, 5, 7\) represented in our clusters but not as separate digits. Obviously this is not a good way to proceed and to find our digits. We will discuss methods to perform this task in the next section Chapter 2. Furthermore, keep this part in mind, as we will come back to how this can still be helpful in a different aspect as semi-supervised learning Chapter 3.
Brunton, Steven L., and J. Nathan Kutz. 2022. Data-Driven Science and Engineering - Machine Learning, Dynamical Systems, and Control. 2nd ed. Cambridge: Cambridge University Press.
Dempster, Arthur P, Nan M Laird, and Donald B Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.”Journal of the Royal Statistical Society: Series B (Methodological) 39 (1): 1–22.
Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 226–31. KDD’96. Portland, Oregon: AAAI Press.
Geron, Aurelien. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow 3e. 3rd ed. Sebastopol, CA: O’Reilly Media.