3 Semi-Supervised learning

There is a hybrid between unsupervised and supervised learning, called semi supervised learning. If we have a dataset with just a couple of labelled observations, we might be able to use the discussed clustering methods to extend the labels and therefore generate more labelled data.

We try this with the (in)famous MNIST dataset of hand written digits. As we did not do this when discussion MNIST in Section 2.2 we first looking at the dataset and establishing some basic properties.

We load the dataset and look at its description (see Section 1.5 for the output)

from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784", as_frame=False)
print(mnist.DESCR)

The images are stored in .data and the label in .target. We can look at the first 100 digits

Show the code for the figure

import numpy as np
from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

im = np.zeros((10 * 28, 10 * 28), int)
for i in range(10):
    for j in range(10):
        im[28*i:28*(i+1), 28*j:28*(j+1)] = mnist.data[i*10 + j].reshape(28, 28)
plt.figure()
plt.imshow(im, cmap="binary")
plt.axis("off")

plt.show()

Figure 3.1: First 100 digits from the MNIST dataset.

Before we go any further in the investigation, we split up the dataset into a training and a test set.

X_train, X_test = mnist.data[:60000], mnist.data[60000:]
y_train, y_test = mnist.target[:60000], mnist.target[60000:]

Now let us try with \(k\)-means to find the 10 digits.

Show the code for the figure

import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

k = 10
kmeans = KMeans(n_clusters=k, n_init=1, random_state=6020).fit(X_train)

im = np.zeros((1 * 28, 10 * 28), int)
for i in range(1):
    for j in range(10):
        im[28*i:28*(i+1), 28*j:28*(j+1)] = \
                kmeans.cluster_centers_[i*10 + j].reshape(28, 28)
plt.figure()
plt.imshow(im, cmap="binary")
plt.axis("off")

plt.show()

Figure 3.2: The cluster means of \(k\)-means for 10 clusters.

As can be seen from Figure 3.2, the cluster centers do not recover our 10 digits. It looks like \(0, 1, 2, 3, 6, 8, 9\) are easy to discover but \(4, 5, 7\) not. If we look closely, we can see \(4, 5, 7\) represented in our clusters but not as separate digits. Obviously, this is not a good way to proceed to find our digits. We discussed methods to perform this task in Chapter 2 now let us consider a different scenario.

Our aim is to only label 50 observations and not more. How can we do this smartly? For this task \(k\)-means is a good choice. Instead of trying to find our 10 digits we try to find 50 clusters within our dataset. We use the images closest to the mean as our representative and label these images. Now instead of labeling just 50 random digits we labelled 50 cluster centers. These labels we can than spread out onto the rest of the clusters and we can test how the performance is.

Important

Due to the nature of these notes, being compiled interactively, we restrict the dataset to 2000 points.

X_train = X_train[:2000]
y_train = y_train[:2000]

Important

The following approach is presented in a similar way in Geron (2022), see GitHub for code in more details.

To get a baseline for our algorithm we use a random forest for the classification and only work with 50 labels.

from sklearn.ensemble import RandomForestClassifier
n_labels = 50
forest = RandomForestClassifier(random_state=6020).fit(
            X_train[:n_labels], y_train[:n_labels])
score_forest = forest.score(X_test, y_test)
print(f"score for training with {n_labels} labels {score_forest}")

forest_full = RandomForestClassifier(random_state=6020).fit(X_train, y_train)
score_forest_full = forest_full.score(X_test, y_test)
print(f"score for training with {len(y_train)} labels {score_forest_full}")

score for training with 50 labels 0.5618
score for training with 2000 labels 0.9117

With our 50 labels we get a bit more than 56.18% correct but of course if we use all labels we can achieve results in the 90% range (if we train with all 60000 we get 97%). So how can we approach this problem?

Instead of just randomly selecting 50 images with labels, let us create 50 clusters, label the image centers, i.e. get good representatives of the classes we are interested in.

Show the code for the figure

import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]

k = 50
kmeans = KMeans(n_clusters=k, n_init=10, random_state=6020)
X_digits_dist = kmeans.fit_transform(X_train)

represent_digit_idx = X_digits_dist.argmin(axis=0)
X_rep = X_train[represent_digit_idx]

im = np.zeros((5 * 28, 10 * 28), int)
for i in range(5):
    for j in range(10):
        im[28*i:28*(i+1), 28*j:28*(j+1)] = X_rep[i*10 + j].reshape(28, 28)
plt.figure()
plt.imshow(im, cmap="binary")
plt.axis("off")

plt.show()

Figure 3.3: The representative (closest image to the cluster mean) of the 50 clusters found with \(k\)-means for the first 1500 digits in the MNIST dataset.

We can now label these observations

y_rep = np.array([
    "3", "4", "1", "9", "9", "0", "3", "0", "9", "1",
    "9", "8", "0", "9", "3", "6", "6", "7", "1", "6",
    "6", "5", "3", "2", "2", "0", "0", "6", "7", "2",
    "5", "1", "7", "3", "4", "8", "6", "0", "8", "5",
    "0", "3", "2", "3", "7", "4", "5", "4", "2", "7" 
])

forest_rep = RandomForestClassifier(random_state=6020).fit(X_rep, y_rep)
score_forest_rep = forest_rep.score(X_test, y_test)
print(f"score for training with {len(y_rep)} labels {score_forest_rep}")

score for training with 50 labels 0.6551

This helped us increase our score significantly, but we can do better. We can extend our labels from the representatives to the entire cluster and train with that.

y_train_prop = np.empty(len(X_train), dtype=str)
for i in range(k):
    y_train_prop[kmeans.labels_ == i] = y_rep[i]

forest_prop = RandomForestClassifier(random_state=6020).fit(
    X_train, y_train_prop)
score_forest_prop = forest_prop.score(X_test, y_test)
print(f"score for training with {len(y_train_prop)} "
      f"propagated labels {score_forest_prop}")

score for training with 2000 propagated labels 0.7672

This again increases our score by another good 10%. If we check our propagated labels, we see that, we can not expect more as we only have an accuracy slightly higher than our classification result, i.e. we provide wrong results labels for our classification.

np.mean(y_train_prop == y_train)

np.float64(0.7745)

Let us try to eliminate outliers by removing the 10% of instances that are far away from our cluster centers.

percentile_closest = 90

X_cluster_dist = X_digits_dist[np.arange(len(X_train)), kmeans.labels_]
for i in range(k):
    in_cluster = (kmeans.labels_ == i)
    cluster_dist = X_cluster_dist[in_cluster]
    cutoff_distance = np.percentile(cluster_dist, percentile_closest)
    above_cutoff = (X_cluster_dist > cutoff_distance)
    X_cluster_dist[in_cluster & above_cutoff] = -1

partially_propagated = (X_cluster_dist != -1)
X_train_pprop = X_train[partially_propagated]
y_train_pprop = y_train_prop[partially_propagated]
np.mean(y_train_pprop == y_train[partially_propagated])

np.float64(0.8102189781021898)

We have cleaned up our source but does it have a big influence?

forest_pprop = RandomForestClassifier(random_state=6020).fit(
    X_train_pprop, y_train_pprop)
score_forest_pprop = forest_pprop.score(X_test, y_test)
print(f"score for training with {len(y_train_pprop)} "
      f"labels {score_forest_pprop}")

score for training with 1781 labels 0.768

We actually do not change our result with this step much.

Nevertheless, overall we could see that by smartly labeling 50 out of 2000 instances we could increase our score from from about 56.18% to 76.8%, which is not bad.

Exercise 3.1 (Improve the results further) There are several ways we can improve our results.

Try to optimize the parameters of the random forest, for number of trees and leaves.
Optimize the clustering for the first step.
We can use other methods we have seen in Chapter 2 and combine them to an ensemble learning.
By starting to additionally label observations where our classifier is the least sure about.
We can again work with clusters to smartly label additional observations.

Note: if we use all the 60000 samples we get about 84% with the presented steps.