Linear classification
Contents:
The approach will have two major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels. We will then cast this as an optimization problem in which we will minimize the loss function with respect to the parameters of the score function.
Parameterized mapping from images to label scores
We have training dataset of images \(x_i \in R^D\), each associated with label \(y_i\). Here \(i = 1 \dots N\), where \(N\) - num examples, and \(y_i \in { 1 \dots K }\), where \(K\) - num classes. We will now define the score function \(f: R^D \mapsto R^K\) that maps the raw image pixels to class scores.
Linear classifier
Can be defined as:
Every row of \(W\) became a classifier for one of the classes.
Interpretation of linear classifiers as template matching. Another interpretation for the weights \(W\) is that each row of \(W\) corresponds to a template (or sometimes also called a prototype) for one of the classes. Another way to think of it is that we are still effectively doing Nearest Neighbor, but instead of having thousands of training images we are only using a single image per class (although we will learn it, and it does not necessarily have to be one of the images in the training set).
Bias trick
A commonly used trick is to combine the two sets of parameters into a single matrix that holds both of them by extending the vector \(x_i\) with one additional dimension that always holds the constant \(1\) - a default bias dimension. With the extra dimension, the new score function will simplify to a single matrix multiply:
Image data processing
It is important to center your data by subtracting the mean from every feature. In the case of images, this corresponds to computing a mean image across the training images and subtracting it from every image to get images where the pixels range from approximately [-127 … 127]. Further common preprocessing is to scale each input feature so that its values range from [-1, 1] (divide by standard deviation).
Loss function
We are going to measure correctness of outcomes classes with a loss function (or sometimes also referred to as the cost function or the objective).
Multiclass Support Vector Machine loss
As a first example we will first develop a commonly used loss called the Multiclass Support Vector Machine (SVM) loss. The SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin \(\Delta\) .
The score function takes the pixels and computes the vector \(f(x_i,W)\) of class scores, which we will abbreviate to \(s\) (short for scores). For example, the score for the j-th class is the j-th element: \(s_j = f(x_i, W)_j\) . The Multiclass SVM loss for the i-th example is then formalized as follows:
Example. Suppose that we have three classes that receive the scores \(s=[13,-7,11]\) , and that the first class is the true class (i.e. \(y_i=0\) ). Also assume that \(\Delta = 10\) . The expression above sums over all incorrect classes \(j \neq y_i\) , so we get two terms:
In summary, the SVM loss function wants the score of the correct class \(y_i\) to be larger than the incorrect class scores by at least by \(\Delta\) (delta).
We can also rewrite the loss function in this equivalent form:
where \(w_j\) is the j-th row of the \(W\) reshaped as a column.
A last piece of terminology we’ll mention before we finish with this section is that the threshold at zero \(max(0, -)\) function is often called hinge loss. You’ll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM), which uses the form \(max(0,-)^2\) that penalizes violated margins more strongly (quadratically instead of linearly)
Regularization
There might be many similar \(W\) that correctly classify the examples. To get such weights we may just multiply existing weights by some scalar \(\lambda > 1\), such as \(\lambda W\).
We wish to encode some preference for a certain set of weights \(W\) over others to remove this ambiguity. We can do so by extending the loss function with a regularization penalty \(R(W)\). The most common regularization penalty is the L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters:
Including the regularization penalty completes the full Multiclass Support Vector Machine loss, which is made up of two components: the data loss and the regularization loss.
Or expanding this out in its full form:
Where \(N\) is the number of training examples. As you can see, we append the regularization penalty to the loss objective, weighted by a hyperparameter \(\alpha\). There is no simple way of setting this hyperparameter and it is usually determined by cross-validation.
The most appealing property is that penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores all by itself. (Also check full lecture notes explanation why mentioned behavior exits.)
Softmax classifier
The other popular choice is the Softmax classifier, which has a different loss function. Unlike the SVM which treats the outputs \(f(x_i,W)\) as (uncalibrated and possibly difficult to interpret) scores for each class, the Softmax classifier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly. We will replace the hinge loss with a cross-entropy loss that has the form:
where we are using the notation \(f_j\) to mean the j-th element of the vector of class scores \(f\). The function \(f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}}\) is called the softmax function: It takes a vector of arbitrary real-valued scores (in \(z\) ) and squashes it to a vector of values between zero and one that sum to one.
Information theory view. The cross-entropy between a “true” distribution \(p\) and an estimated distribution \(q\) is defined as:
The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities ( \(q = e^{f_{y_i}} / \sum_j e^{f_j}\) as seen above) and the “true” distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e. \(p=[0,\ldots 1,\ldots ,0]\) contains a single 1 at the \(y_i\) -th position.). Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as \(H(p,q) = H(p) + D_{KL}(p||q)\) , and the entropy of the delta function \(p\) is zero, this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance). In other words, the cross-entropy objective wants the predicted distribution to have all of its mass on the correct answer.
Probabilistic interpretation. Looking at the expression, we see that
can be interpreted as the (normalized) probability assigned to the correct label \(y_i\) given the image \(x_i\) and parameterized by \(W\). To see this, remember that the Softmax classifier interprets the scores inside the output vector \(f\) as the unnormalized log probabilities. Exponentiating these quantities therefore gives the (unnormalized) probabilities, and the division performs the normalization so that the probabilities sum to one. In the probabilistic interpretation, we are therefore minimizing the negative log likelihood of the correct class, which can be interpreted as performing Maximum Likelihood Estimation (MLE). A nice feature of this view is that we can now also interpret the regularization term \(R(W)\) in the full loss function as coming from a Gaussian prior over the weight matrix \(W\), where instead of MLE we are performing the Maximum a posteriori (MAP) estimation.
Practical issues: Numeric stability. When you’re writing code for computing the Softmax function in practice, the intermediate terms \(e^{f_{y_i}}\) and \(\sum_j e^{f_j}\) may be very large due to the exponentials. But if we multiply the top and bottom of the fraction by a constant \(C\) and push it into the sum, we get the following (mathematically equivalent) expression:
We are free to choose the value of \(C\). This will not change any of the results, but we can use this value to improve the numerical stability of the computation. A common choice for \(C\) is to set \(\log C = -\max_j f_j\). This simply states that we should shift the values inside the vector \(f\) so that the highest value is zero.
f = np.array([123, 456, 789]) # example with 3 classes and each having large scores p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup # instead: first shift the values of f so that the highest number is 0: f -= np.max(f) # f becomes [-666, -333, 0] p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer
Possibly confusing naming conventions. To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss. The Softmax classifier uses the cross-entropy loss.
Softmax classifier provides “probabilities” for each class
We put the word “probabilities” in quotes, however, is that how peaky or diffuse these probabilities are depends directly on the regularization strength \(\lambda\) - which you are in charge of as input to the system. For example, suppose that the unnormalized log-probabilities for some three classes come out to be [1, -2, 0]. The softmax function would then compute:
Where the steps taken are to exponentiate and normalize to sum to one. Now, if the regularization strength \(\alpha\) was higher, the weights \(W\) would be penalized more and this would lead to smaller weights. For example, suppose that the weights became one half smaller ([0.5, -1, 0]). The softmax would now compute:
where the probabilites are now more diffuse. Moreover, in the limit where the weights go towards tiny numbers due to very strong regularization strength \(\alpha\), the output probabilities would be near uniform.