In this section we collect more detailed explanations to help clear up some things that are only lightly touched in the main notes.
A.1 Wavelet decomposition for cats and dogs
In the introduction of the Clustering and Classification section we discuss how to use the wavelet transformation1 to transform the images of cats and dogs into a different basis. Here are the details of how this is performed, with cat zero as example.
We use the Haar-Wavelet and we only need to do one level of transformation. As per usual we get four images, each half the resolution, that represent the decomposition. The images are, a downsampled version of the original image, one highlighting the vertical features, one highlighting the horizontal features, and one highlighting the diagonal features.
Figure A.4: The horizontal highlights of the image.
Figure A.5: The diagonal highlights of the image
For our purposes, only the vertical and horizontal feature are of interest, and we combine these two images. In order to make sure the features are highlighted optimal, we need to rescale the images before combining them. For this we use a similar function like the MATLAB wcodemat function.
def rescale(data, nb): x = np.abs(data) x = x - np.min(x) x = nb * x / np.max(x) x =1+ np.fix(x) x[x>nb] = nbreturn x
Note, the resulting image has only one forth of the pixels as the original image. We can also visualize the transformation steps as follows in Figure A.8.
Figure A.8: Workflow to get from the original image to the wavelet transformed version.
A.2 Precision/Recall trade-off
In Section 2.2 we discuss the performance topics and we come across the si called precision/recall trade-off.
Lets remind ourself of the definitions:
Recall or true positive rate (TPR) is the rate of relevant instances that are retrieved, or true positive over all occurrences\[
\operatorname{recall} = \frac{TP}{P} = \frac{TP}{TP + FN}.
\]
Precision on the other hand is the rate of relevant instances over all retrieved instances, or true positive over the sum of true positive and false positive. \[
\operatorname{precision} = \frac{TP}{TP + FP}.
\]
In order to understand why precision and recall influence each other we need to understand how our classifier works.
Internally each observation given to the classifier is fed into a decision function that returns a score.
The score is on some scale and in the default setting, everything above zero is counted as a match, if the threshold is set differently this can change. See Figure A.9. In the presented example we can have a precision from \(71\%\) to \(100\%\) and at the same time a recall from \(100\%\) to \(60\%\).
Figure A.9: Some representatives and their score and three different thresholds and the corresponding results for precision and recall.
Code that provides the basis for the above figure.
Figure A.10: Precision and recall vs. the score of the decision function.
Figure A.11: Precision vs. recall.
With the help of the precision vs. recall curve we can select a threshold appropriate for our classification, i.e. level between precision and recall as we see fit and our classification allows.
A.3 Details on softmax
For Neural Networks (NNs) a common activation function is the so called softmax function, defined as \[
\sigma: \mathbb{R^n} \to (0, 1)^n, \quad \sigma(x)_i = \frac{\exp(x_i)}{\sum_j{\exp(x_j)}}.
\]
In this section we are going to try explain in more detail what it does and what it is used for.
This particular activation function is most commonly used as the output layer of an NN for a multi-class classification/learning problem, as it is the case in our dogs vs. cats example.
Loosely speaking, the main idea is, that it transforms a vector into probabilities. By scaling via the sum of all entries we end with a total sum of 1. This makes us independent from the actual scaling the previous layers of the network worked with and lands us alway in the interval \([0, 1]\). The probability we get can be interpreted as the confidence of the network in the classification per class.
Lets look at it via an example, we assume we have a simple NN with only the two layers, the last being our softmax layer.
Figure A.12: A small NN for classifying hand written digits, we only distinguish between 5, 6 and all others.
We use the hand written letters from Section A.2 as our example, where we only have three classes: 5, 6, else. Here else should be understood as neither 5 nor 6. We classify the images from right to left. The output of our fist layer is a number with no apparent scaling, after the softmax we can see the most likely class suggested from our NN.
Figure A.13: From several images to the output of the NN.
As we can see, the exponential scaling makes sure that we separate quite well and we can easily destinguish between the three classes We can also see, when it is very close and the NN is not very sure what the correct class is.