If we recall Fisher (1936) from the Iris dataset, we can also find one of the first supervised learning methods in this paper. The introduced linear discriminant analysis (LDA) has survived over time and is still one of the standard techniques for classification, even though we use a more generalized and improved method nowadays.
2.1 Linear Discriminant Analysis (LDA)
Important
The following introduction and illustrational dataset, as well as the basic structure of the code is from (Brunton and Kutz 2022, Code 5.9). Also see GitHub.
The idea of LDA is to find a linear combination of features that optimally separates two or more classes. The crucial part in the algorithm is that it is guided by labelled observations. At its core, the algorithm aims to solve an optimization problem: find an optimal low-dimensional embedding of the data that shows a clear separation between their point distribution, or maximize the distance between the inter-class data and minimize the intra-class data distance.
Tip
For supervised learning, it is always good practice to split up our dataset into a training and testing section. It is important to have a test set that the algorithm has never seen! In general a \(80:20\) split is common but other ratios might be advisable, depending on the dataset.
It is also common practice to use \(k\)-folds cross-validation for the training set.
Definition 2.1 (\(k\)-folds cross-validation) The \(k\)-folds cross-validation technique is a method to allow better hyperparameter tuning, especially for smaller datasets where training and validation data is small. The main idea is that we split our iterate over different validation sets by splitting up our training set. Lets say we use \(5\)-fold cross-validation we split the training set into 5 parts. Therefore, we train the model with four parts and validate against the fifth. We then rotate the folds and select a different one for validation. At the end, we average over the five iterations to get our final parameters.
This looks something like this:
Figure 2.1: Common split with the folds 1 to 4 for training and 5 for validation in the first iteration and folds 2 to 5 for training and 1 for validation in the last. The test set is not touched.
It is important that the test set is not included in the folds to make sure we test against observations that the algorithm has never seen!
The main idea of LDA is to use projection. For a two-class LDA this becomes \[
w = \operatorname{arg\, max}_w \frac{w^\mathrm{T}S_B w}{w^\mathrm{T}S_W w},
\tag{2.1}\] (the generalized Rayleigh quotient) where \(w\) is our thought after projection. The two included scatter matrices are \[
S_B = (\mu_2 - \mu_1)(\mu_2 - \mu_1)^\mathrm{T},
\] for between-class relation as well as \[
S_W = \sum_{j=1}^2 \sum_{x\in\mathcal{D}_j} (x - \mu_j)(x - \mu_j)^\mathrm{T},
\] for within-class data. The set \(\mathcal{D}_j\) denotes the subdomain of the data associated with cluster \(j\). The two matrices measure the variance of the dataset as well as the means. To solve Equation 2.1 we need to solve the generalized eigenvalue problem1\[
S_B w = \lambda S_w w
\] where the maximal eigenvalue and the corresponding eigenvector are our solution.
We try this with the cats and dogs dataset in both basis.
(b) Trained and evaluated against the data in wavelet basis.
Figure 2.2: Evaluation of the LDA for the second and fourth principal component on the test set of 40 animals. A bar going up corresponds to dogs and one going down to cats. The first 20 individuals should be dogs, the second 20 cats. The red dotted line shows the split. True positive can be found in the top-left as well as the bottom-right.
If we use our raw dataset for the classification we get an overall accuracy of 67.5% with \(\tfrac{4}{20}\) wrongly labelled dogs and \(\tfrac{9}{20}\) wrongly labelled cats. We can increase this to an accuracy of 82.5% with \(\tfrac{5}{20}\) wrongly labelled dogs and \(\tfrac{2}{20}\) wrongly labelled cats.
This could be expected, see Figure 10 for the separation of the principal values for the two basis.
Of course we have very limited data with only 80 images for each of the classes. In this case we should do a cross-validation and we have not shuffled the data.
Let us see how selecting different test and training sets influence the behaviour.
Figure 2.3: Cross validation for the dataset in wavelet basis, use 100 run with different training and test sets. We always use 120 images for training and 40 for testing.
With a maximal accuracy of 90.0% and a minimal accuracy of 65.0% our initial result with 82.5% was quite good and above average (77.1%). We can also see that training the model is always better than just a simple coin toss or random guessing for cat or dog.
Instead of a linear discriminants, we can also use quadratic discriminants. To show the difference let us look at the classification line of the two methods for our data in wavelet basis
Figure 2.4: Classification line for the LDA together with actual instances.
Figure 2.5: Classification line for the QDA together with actual instances.
As we can see in Figure 2.5, having a quadratic discriminant classification line can be rather beneficial, like always depending on the observations. The QDA arises from LDA when we do not assume that the covariance of each of the classes is the same.
Note
LDA and QDA assume a normal distribution as the basis for each of the clusters. This allows us to write it also as an update procedure with Bayes2 theorem.
Tip
Where for the LDA it is possible to get the correct function for the classification line this is tricky for the QDA. Luckily the scikit-learn class/function DecisionBoundaryDisplay.from_estimator can help in such cases.
Exercise 2.1 (Application of LDA)
Apply the LDA algorithm to the toy example (see Figure 1.1) to recover the two clusters as good as possible.
Additionally, for a higher dimensional problem, using LDA split the Fischer Iris dataset (see the introduction to part of the notes) into two clusters. Try for the harder split between versicolor and virginica types of flowers.
2.2 Measuring Performance
As in most applications, the question how good an algorithm performs is not easy to establish. In Figure 2.3, we implied we are doing better than a coin toss but we should be able to characterize this more precise.
Important
The following approach and the basic structure of the code is from Geron (2022), see GitHub.
In order to illustrate basic properties found in machine learning we use the MNIST dataset together with a binary classifier based on Stochastic Gradient Descent, see Section 1.5 for more on the MNIST dataset.
Note
Stochastic Gradient Descent (SGD)3 is an optimization algorithm. The key idea is to replace the actual gradient by a stochastic approximation in the optimization of the loss function. This allows especially good performance for large-scale learning and sparse machine learning problems.
We can use this training method for classification to find the optimal parameters for our loss function and in turn, this can be used as a binary classifier.
As SGD methods are prone to a sensitivity in feature scaling and order we need to make sure to use normalized data and we should shuffle.
Next, as we only want a binary classifier, we select one number, in our case 5 and relabel our data. With the new labels we can train our classifier, compare Definition 2.9.
In order to get a score for our method, we use \(k\)-folds cross-validation Definition 2.1 and the corresponding scikit-learn function cross_val_score to perform this task for our model.
With scores in the high \(90\%\) range the results look promising, if not great, but are they really that good? In order to get a better idea just always guess that we do not see a 5 that should be the most common class in our case. To simulate this we use the DummyClassifier class.
As this is pretty much \(91\%\) (as expected there are only about \(10\%\) of 5s in the dataset). Just using accuracy is apparently not the gold standard to measure performance, what other possibilities are there?
Definition 2.2 (Confusion Matrix) The confusion matrix, error matrix or for unsupervised learning sometimes called matching matrix allows an easy way of visualizing the performance of an algorithm.
The rows represent the true observations in each class, and the columns the predicted observations for each class.
In our \(2\times 2\) case we get
Figure 2.6: Names and abbreviations for a \(2\times 2\) confusion matrix together with an example form our test case.
but it can be extended for multi-class segmentation.
To compute the confusion matrix we first need predictions. This can be achieved by cross_val_predict instead of cross_val_score and than we use confusion_matrix from sklearn.metrics.
In the sklearn.metrics most of these values have a corresponding function. If we look at precission, recall, and the \(F_1\) score, for our example we see that our performance is viewed under a different light:
This tells us that our classifier correctly classifies a 5 81.31% of the time. On the other hand it only recalls or detects 77.2% of our 5s. The \(F_1\)-score is a combination of the two (the harmonic mean) and in our case 79.2%.
Depending on the application we might want to have high precision (e.g. medical diagnosis to have no unnecessary treatment) or high recall (e.g. fraud detection where a missed fraudulent transaction can be costly). If we increase precision we reduce recall and the other way round so we can hardly have both. This dilemma is called precision/recall trade-off, see Section A.2 for some more explanations.
An alternative way to look at accuracy for binary classifiers is to look at the receiver operating characteristic (ROC). It looks at recall (TPR) vs. the false positive rate (FPR). Other than that it works similar.
2.3 Support Vector Machine (SVM)
The basic idea of Support Vector Machines (SVMs) is to split observations into distinct clusters via hyperplanes. The have a long history in data science and come in different forms and fashions. Over the years they became more flexible and are still one of the most used tools in industry and science.
2.3.1 Linear SVM
We start of with the linear SVM where we construct a hyperplane \[
\langle w, x\rangle + b = 0
\] with a vector \(w\) and a constant \(b\). There is a natural degree of freedom in this selection of the hyperplane, see Figure 2.7 for two different choices.
(a) Hyperplane with small margin.
(b) Hyperplane with large margin.
Figure 2.7: We see the hyperplane for the SVM classification scheme. The margin is much larger in the second choice.
The optimization inside the SVM aims to find the line that separates the classes best (fewest wrong classifications) and also keeps the largest margin between the observations (the yellow region). The vectors touching the edge of the yellow regions are called support vectors giving the name to the algorithm.
With the hyperplane it is easy to classify an observation by simply computing the sign of the projection, i.e. \[
y_j (\langle w, x_j \rangle + b) = \operatorname{sign}(\langle w, x_j \rangle + b) = \begin{cases} +1\\-1\end{cases},
\] where \(1\) corresponds to the versicolor (orange) and \(-1\) setosa (blue) observations in Figure 2.7. Therefore, the classifier depends on the position of the observation and is not invariant under scaling.
Exercise 2.2 (Linear SVM)
Compute the vector \(w\) in the two cases of Figure 2.7. The vector \(w\) is normal to the line. For Figure 2.7 (a) two points on the line are \(v_1 = [1.25, 4.1]^\mathrm{T}\), \(v_2 = [5, 7.4]^\mathrm{T}\). For Figure 2.7 (b) two points on the line are \(z_1 = [2.6, 4.25]^\mathrm{T}\), \(z_2 = [2, 7]^\mathrm{T}\).
Classify the two points \[
a = [1.4, 5.1]^\mathrm{T},
\]\[
b = [4.7, 7.0]^\mathrm{T}.
\]
Stating the optimization function such that it is smooth for a linear SVM is a bit tricky. On the other hand, this is needed to allow for most optimization algorithm to work, as they require a gradient to some sort.
Therefore, the following formulation is quite common: \[
\underset{w, b}{\operatorname{argmin}} \sum_j H(y_j, \overline{y}_j) + \frac12\|w\|^2 \quad \text{subject to}\quad \min_j|\langle x_j, w\rangle| = 1,
\] with \(H(y_j, \overline{y}_j) = \max(0, 1 - \langle y_j, \overline{y}_j\rangle)\), the so called Hinge loss function for counting the number of errors. Furthermore, \(\overline{y}_j = \operatorname{sign}(\langle w, x_j\rangle + b)\).
2.3.2 Nonlinear SVM
In order to extend the SVM to more complex classification curves, the feature space can be extended. In order to do so, SVM introduces nonlinear features and computes the hyperplane on these features via a mapping \(x \to \Phi(x)\) and the hyperplane function becomes \[
f(x) = \langle w, \Phi(x)\rangle + b,
\] and accordingly we classify along \[
\operatorname{sign}(\langle w, \Phi(x_j)\rangle + b) = \operatorname{sign}(f(x_j)).
\]
Essentially, we change the feature space such that a separation is (hopefully) easier. To illustrate this we use a simple one dimensional example as shown in Figure 2.8 (a). Clearly there is no linear separation possible. On the other hand, if we use \[
\Phi(x_j) = (x_j, x_j^2)
\] as our transformation function we move to 2D space and the problem can easily be solved by a line at \(y=0.25\).
(a) Observations that can not be separated linearly.
(b) Enriched feature set with Φ(x)=(x, x^2).
Figure 2.8: Nonlinear classification with SVM.
As can be seen in Figure 2.8 (b) the SVM does a great job in finding a split for the two classes, even though not selecting the optimal line, which is not surprising for the given amount of observations.
As mentioned before, SVMs are sensitive to scaling. Let us use this example to illustrate the difference together with the concept of pipelines often used in data science context.
Pipeline
The main idea of a pipeline is to create a composite as a ordered chain of transformations and estimators, see docs for some more insights.
(a) Classification in the enriched Φ(x)=(x^0, x^1, x^2) and scaled space. Note the first dimension is ignored.
(b) Difference between the classification lines when transformed back into the original space.
Figure 2.9: Classification with autoscaler vs. no scaler.
Exercise 2.3 (Nonlinear SVM)
Extend the above findings to an example in 2D with a circular classification line. Create tests data of your classification by changing to np.linspace(-1, 1, 12).
Recall the moons example from Section 1.3 and use a degree \(3\)PolynomialFeatures for classification.
In both cases, plot the classification line in a projection onto the original 2D space.
2.3.3 Kernel Methods for SVM
While enriching the feature space is, without doubt, extremal helpful the curse of dimensionality is quickly starting to influence the performance. The computation of \(w\) is getting harder. The so called kernel trick is solving this problem. We express \(w\) in a different basis and solve for the parameters of the basis, i.e. \[
w = \sum_{j=1}^m \alpha_j \Phi(x_j)
\] where \(\alpha_j\) are called the weights of the different nonlinear observable functions \(\Phi(x_j)\). Our \(f\) becomes \[
f(x) = \sum_{j=1}^m \alpha_j \langle \Phi(x_j), \Phi(x) \rangle + b.
\] The so called kernel function is defined as the inner product involved, i.e. \[
K(x_j, x) = \langle \Phi(x_j), \Phi(x) \rangle.
\] The optimization problem for \(w\) now reads \[
\underset{\alpha, b}{\operatorname{argmin}} \sum_j H(y_j, \overline{y}_j) + \frac12\left\|\sum_{j=1} \alpha_j \Phi(x_j)\right\|^2 \quad \text{subject to}\quad \min_j|\langle x_j, w\rangle| = 1,
\] with \(\alpha\) representing the vector of all the \(\alpha_j\). The important part here is that we now minimize of \(\alpha\), which is easier.
The kernel function allow almost arbitrary number of observables as it, for example, can represent a Taylor series expansion. Furthermore, it allows an implicit computation in higher dimensions by simply computing the inner product of differences between observations.
One of these functions are so called radial basis functions (RBF) with the simplest being a Gaussian kernel \[
K(x_j, x) = \exp\left(-\gamma\|x_j - x\|^2\right).
\]
In scikit-learn this is supported via the SVC class.
Let us test this implementation with the help of our dogs and cats example.
Figure 2.11: Training a SVM with an RBF kernel for the singular vectors 2 to 22. The picture shows the classification results projected for the principal components 2 and 4.
We get a confusion matrix for our test set as
PP
PN
P
18
3
N
2
17
In Figure 2.11 we can see the results of the classification for the entire set of observations, shaded for the training set, and crosses marking the wrongly classified data. With 8 wrongly classified images we have quite a good result, compared to LDA or QDA Figure 2.5. Note, the classification is hard to recognise for the two classes in the simple projection. With the parameters C and gamma we can influence the classification.
Exercise 2.4 (Nonlinear SVM with RBF) Recall the moons example from Section 1.3 and use a SVC classification to distinguish the clusters. Look at four different results for \(\gamma \in \{0.1, 5\}\) and \(C \in \{0.001, 1000\}\), compare Geron (2022).
In all of the four images plot the classification line in a projection onto the original 2D space.
Exercise 2.5 (Nonlinear SVM for regression) We can use SVM for regression. Have a look at docs and use the findings to fit the following observations with various degrees and kernel functions.
import matplotlib.pyplot as pltimport numpy as np%config InlineBackend.figure_formats = ["svg"]np.random.seed(6020)m =100x =6* np.random.rand(m) -3y =1/2* x **2+ x +2+ np.random.randn(m)fig = plt.figure()plt.scatter(x, y, label="observations")plt.show()
Decision trees are a common tool in data science, for classification and regression. They are a powerful class of algorithms, that can fit not only numerical data. Furthermore, they form the basis of random forests, one of the most powerful machine learning algorithms available to date.
They where not invented for machine learning but have been a staple in business for centuries. Their basic idea is to establish an algorithmic flow chart for making decisions. The criteria that creates the splits in each branch is related to a desired outcome and are therefore important. Often experts are called upon creating such a decision tree.
The decision tree learning follows the same principals to create a predictive classification model based on the provided observations and labels. Similar to DBSCAN, they form a hierarchical structure that tries to split in an optimal way. In this regard, they are the counterpart to DBSCAN, as they move from top to bottom and of course use the labels to guide the process.
The following key feature make them wildly use:
The usually produce interpretable results (we can draw the graph).
The algorithm mimics human decision making, which helps for the interpretation.
The can handle numerical and categorical data.
They perform well with large sets of data.
The reliability of the classification can be assessed with statistical validation.
While there are a lot of different optimizations the base algorithm follows these steps:
Look through all components (features) of an observation \(x_j\) that gives the best labeling prediction \(y_j\).
Compare the prediction accuracy over all observations, the best result is used.
Proceed with the two new branches in the same fashion.
Let us apply it to the Fischer Iris dataset to better understand what is happening.
Figure 2.12: Decision tree for the Fischer iris dataset and depth 2.
Displaying dot files
In the code for Figure 2.12 we generate a dot file, that is interpreted with quarto. This allows a better visual integration. In order to do the same offline you need to install graphviz for the installation of dot and also install the python package pydotplus.
Than you should be able to use:
import pydotplusfrom six import StringIOdot_data = StringIO()graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) graph.create_png()
As we can see, for our tree with depth \(2\), we only need to split along petal length (cm) and petal width (cm), leaving the two other features untouched, compare Figure 2.
As we only have the splits happening in these two variables we can also visualize them easily.
Figure 2.13: Splits for the Fischer Iris dataset with the first two split form the above tree and the third split would be the next step for a larger tree.
In Figure 2.13 we can see the the first two splits and the next split if we would increase the tree. With the first split, we immediately separate setosa with \(100\%\) accuracy. The two other classes are a bit tricky and we can not classify everything correct right away. In total \(6\) out of \(150\) observations are wrongly classified with this simple tree.
Let us also apply the tree classification to our cats and dogs example.
Figure 2.14: Decision tree for the Fischer iris dataset and depth 2.
PP
PN
P
18
3
N
2
17
If we compare our confusion matrix for the test cases to the one of SVM we get comparable results. In general, we can see that the first split is along \(PC_2\) (note that we do not use the \(PC_1\) in the code and therefore \(x_0=PC_2\)) and our second split is along \(PC_4\). We used the same components before. The third split is along \(PC_5\), which we did not consider before hand. Overall the mean accuracy for out test set is 77.5%.
Exercise 2.6 (A tree on the moon) Recall the moons example from Section 1.3 and use a DecisionTreeClassifier classification to distinguish the clusters and plot the decision splits.
Play around with the parameters, e.g. min_samples_leaf = 5 and see how this influences the score for a test set.
Exercise 2.7 (Tree regression) We can use a decision tree for regression. Have a look at docs and use the findings to fit the observations of Exercise 2.5 with various max_depth values and no value here but limiting min_samples_leaf=10.
Sensitivity to rotation and initial state
Due to the nature of decision trees and the way they split observations by lines, they become sensitive to rotation. Furthermore, the tree construction is based on a random process as the feature for the split is chosen at random.
(a) Clear split along the middle for a vertical split.
(b) More complicated structure for the rotated observation.
(c) Different initial random state for the rotated observations.
(d) Correction of the rotated observations via PCA and scaling, resulting in an easy split.
Figure 2.15: Illustration of the sensitivity to rotation of decision trees. Note both trees split perfectly.
In Figure 2.15 (a) we see the split for the original dataset - random numbers in \([-0.5, 0.5]^2\) and the classes separation for \(x_1 > 0\). A simple line, i.e. one split is enough to make the separation. If we rotate the dataset by by 45° we can see a way more complicated separation line in Figure 2.15 (b). Furthermore, Figure 2.15 (c) shows that the random_state has an influence as well. Finally, if we apply scaling and PCA to the data we more or less end up at our original sample again Figure 2.15 (d), showcasing the power of these feature extraction techniques once more.
We note, all the trees make a perfect classification, just the structure is not as easy to recognize as it could be.
As we could see, trees are very sensitive to the observations and the state. In order to counteract this phenomenon we can compute multiple trees and average the results. This combinations of trees is called ensemble and leads to the next topic.
2.5 Ensemble learning and random forests
The notion of wisdom of the crowd suggests, that the decision of multiple people averaged, results in a better decision/judgment than the decisions of a single person. Of course there is a lot of statistics going in, if we want to compute exactly how the result is influenced, or if we should use the same weights for each classification, or e.g. the notion of experts, and much more.
Nevertheless, we can use this concept in our context to create so called ensemble methods, the process itself is called ensemble learning. With this concept we can combine already good classification methods and get a better result than the best classification method included.
If we train a multitude of decision trees on various (random) subsets of our observations, we can combine the predictions of the individual trees to an ensemble prediction. The resulting method is called a random forest. This very simple process allows us to generate very powerful classification methods.
There are some different approaches for the combination of such methods, we only discuss them briefly, see (Geron 2022, chap. 7) for a more detailed discussion.
Important
All of the below discussed methods and approaches can be found in the sklearn.ensemble module.
Definition 2.3 (Voting classifiers) If we have a set of classifiers \(C=\{c_1, \ldots, c_n\}\) of various kinds (even another ensemble classifier like a random forest is welcome), we can simple make a prediction with each resulting in \(r_1, \ldots, r_n\). Now we select the class which occurs most often, i.e. the mode of the predictions, we get a new classifier. This is called hard voting.
If all our classifiers can not only produce a prediction but a probability for our prediction, we can also create a soft voting classifier. All we need to do, average the probability of the predictions \(p_1, \ldots, p_n\), and this will give us new probabilities for our ensemble classifier.
Figure 2.16: Illustration of the difference between hard and soft voting for a ensemble method. For the three shown classifiers the class 0 is the most common. When moving to probabilities, the mean also predicts 0, where more convinced classifiers get a higher weight.
Definition 2.4 (Bagging and Pasting) For bagging and pasting the idea is different. Instead of influencing the output, theses approaches influence how to manipulate the training of a set of (potentially equal) classifiers, to achieve overall better results.
Bagging (bootstrap aggregating) uses sampling with replacement, i.e. the same observation can end up multiple times in the training set of the same classifier. Pasting uses sampling without replacement and therefore an observation can be in multiple classifiers but not more than once per classifier.
To predict, we can use hard or soft voting from above.
Figure 2.17: Illustration of the random sampling for bagging and pasting in ensemble classifiers.
With these sampling methods it is possible to use the out-of-bag observations (everything that is not used for a particular training) for evaluation of the trained classifier. This is called out-of-bag evaluation.
Definition 2.5 (Random Patches and Random Subspaces) For bagging and pasting it is also possible to sample features and not only observations, i.e. what to look at. This results in a random subset of input features for training for each classifier. This is especially useful, for high dimensional data inputs such as images, as it can speed up the learning process.
We call such methods random patches method if we sample both, training observations and training features.
On the other hand, if we keep the training observations fix and only sample the training features the resulting method is called a random subspace method.
In scikit-learn this is can be achieved by manipulating the arguments max_features, bootstrap_features, and max_samples in the BagginClassifier class.
Definition 2.6 (Random Forest) An ensemble of decision tress, (usually) trained via bagging is called a random forest.
With a random forest it is quite easy to find out what features are important for the overall result. In scikit-learn this is automatically computed for the sklearn.ensemble.RandomForestClassifier class.
Definition 2.7 (Boosting)Boosting or sometimes hypothesis boosting is the process of using ensemble methods to use slow learners to create a fast learner.
The general idea is to train a sequence where the output of one is the input of the next classifier. This corrects the previous result and therefore helps to achieve overall better results.
The most common methods are called AdaBoost (adaptive boosting) and gradient boosting.
Definition 2.8 (Stacking) The general idea of stacking is, to use a blender for combining the results of our ensemble methods. The blender is not just a linear combination like for soft/hard voting but another classifier/model to perform a hopefully better combination. Of course we can stack this approach and produce multiple layers before we combine them into a single result.
Figure 2.18: Illustration of an \(m\) layer stacking with various classifiers in each layer.
Exercise 2.8 (Ensemble methods) We follow the example of (Geron 2022, chap. 7) to explore the various possibilities for ensemble methods.
For our dataset we use the moons example:
from sklearn.datasets import make_moonsfrom sklearn.model_selection import train_test_splitX, y = make_moons(n_samples=500, noise=0.30, random_state=6020)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=6020)
Use the VotingClassifier class with an LDA, a SVM, and a tree for classification. Make sure to set random_state for each, to have reproducible results. Get the overall score for the test set, as well as the individual scores for the included classifiers.
for name, clf in voting.named_estimators_.items():print(f"{name}, =, {clf.score(X_test, y_test)}")
Switch to soft voting for your ensemble classifier and see how this influences the results.
Create a BaggingClassifier with 500 trees and and an appropriate max_samples value. Report the score for this new classifier, also report the out-of-bag score via .oob_score_.
Create directly a RandomForestClassifier with 500 trees and appropriate value for max_leaf_nodes.
Create a StackingClassifier with an LDA, a SVM, and a tree for classification and a random forest as the final blending step and report the score of this method.
Train a random forest for the Fischer Iris dataset and check .feature_importances_ to get an insight on the importance of each feature.
Train a random forest for our dogs and cats dataset in raw and wavelet form and check .feature_importances_ to get an insight on the importance of each feature of the PCA.
2.6 Multiclass Classification
We mainly focused on the classification of two classes in the last sections, but obviously not all problems only consist of two classes. Some of the methods discussed, like random forests, support multiclass classification out of the box. For others, there are several approaches to create a multiclass classifier out of a binary classifier.
Definition 2.9 (One vs. the Rest (OvR) or One vs. All (OvA)) The one versus the rest (OvR) or one versus all (OvA) strategy is to train a binary classifier for each class and always interpret all other classes as the others or the rest class. To classify an observation you get the scores for each classifier and select the one with the highest score.
For the Fischer Iris dataset this would result in three classifiers (one for each iris), for the MNIST dataset in ten (one for each number).
Definition 2.10 (One vs. One (OvO)) The one versus one (OvO) strategy is to train a binary classifier for always two classes and build up a set of classifiers, for \(n\) classes we get \(\tfrac{n (n-1)}{2}\) classifiers. To classify an observation you get the result for each classifier and select the one class with the most duels won.
For the Fischer Iris dataset this would result in three classifiers, for the MNIST dataset 45.
The advantage of OvO over OvR is that each classifier only needs to be trained on a subset and not with the entire dataset. This is especially useful for algorithms that do not scale well, like SVMs.
Tip
In scikit-learn the framework automatically realizes that we train a binary classifier for multiple classes and it will select OvR or OvO automatically, depending on the algorithm.
Create a confusion matrix for more than two classes with your results and interpret the results.
Brunton, Steven L., and J. Nathan Kutz. 2022. Data-Driven Science and Engineering - Machine Learning, Dynamical Systems, and Control. 2nd ed. Cambridge: Cambridge University Press.