Support Vector Machines (SVMs)

Overview

Support vector machines (SVMs) are types of supervised machine learning models used for classification tasks. SVMs involve a lot more math than previous methods used in this project, and it is helpful to understand the basic math of the model to have a better understanding of what is going on behind the scene. SVMs are linear separators, meaning that they work by finding a line or hyperplane between the different classes. The optimal hyperplane chosen by the model is the one that maximizes the margin, that is, finding the line that makes the distance between the classes the greatest. For example, look at Figure 1. The data in this figure is linearly separable, and there are many lines drawn on the image to show how many different possible hyperplanes exist. Infinite exist, and the best one is chosen by the one that maximizes the margin. Figure 2 shows the hyperplane chosen that maximizes the margin. In this image, the w represents the weight vector, and x is the data point. In some cases a bias term (b) is also added, so for example the decision boundary equation would be w^Tx + b = 0. For two class SVMs, the labels are represented as -1 and 1 (negative and positive label), therefore the hyperplane equations can be both be written as yi(w^Txi + b) – 1 >= 0 .

Figure 1: Linearly separable data with many possible hyperplanes

Figure 2: The best hyperplane that maximizes the margin

Let’s look at some more figures to see how SVMs work to classify new data. By doing some math, it is known that the weight vector, w, is perpendicular to the margin and has a slope of w1/w2. Suppose we have a new vector v and want to determine if it is a positive or negative class. The way we can do this is by projecting v onto w. Figures 3 & 4 explain and show that the sign of the dot product between w and v determine what class v falls into.

Figure 3: Example of predicting what class vector v is in

Figure 4: Figure showing that the sign of the dot product of the weight vector and the data vector determine the class

Given some math and proofs, the equation for the margin comes out to be margin = [(x^– – x⁺) dot w]/||w||, and the length of the margin ends up being 2/||w||. Figure 5 shows the primal and dual equations, what is used to find the maximum margin. For mathematical complexity, the goal is to minimize ½||w||^2 with the constraint that yi(w^Txi + b) – 1 >= 0 (the equation for the positive/negative hyperplanes), which is minimizing the primal equation. To optimize something with constraints, Lagrange multipliers are helpful in this. After finding the derivatives and solving after setting equal to zero, we end up with the dual equation shown in Figure 5. The dual equation is maximized to find the optimal margin, and the alphas in the equation are the Lagrangian multipliers. It is important to note that the optimization only depends on the dot product of pairs of samples, which will be helpful moving forward.

Figure 5: Equations to get the maximum margin

What happens when the data is not linearly separable? Take the XOR problem for example, shown in Figure 6. As one can see, there is no way to draw a line that separates the two classes (red and blue points represent different classes). This is where the use of the dot products and kernels come into play. Figure 7 shows the definition of a kernel, which is the dot product between two vectors. The kernel function transforms the data into a higher dimension space, and we can find the dot product of the transformed points without having to transform them ourselves. Figure 8 shows that data from Figure 6 after being transformed by a polynomial kernel with r = 0 and d = 2. Common examples of kernels are the polynomial and radial kernels, shown in Figures 9 & 10 respectively.

Figure 6: XOR problem, data is not linearly separable

Figure 7: Definition of a kernel function

Figure 8: XOR problem, now linearly separable after applying a kernel function to the data and transforming into 3D space

Figure 9: Polynomial kernel where r is the coefficient of the polynomial and d is the dimension

Figure 10: Radial kernel, where a and b are the vectors of data and gamma is 1/(2σ²)

To get a better understanding of how kernels work, let’s look at an example of casting 2D points into 6D space using the polynomial kernel with r = 1 and d = 2. Figure 11 shows that when expanding this kernel, we get the dot product of each of the transformed point into 6D space. Figure 12 shows that plugging the original points into the kernel equation leads to the same answer as the dot product between the transformed points. This is extremely useful, showing that kernel functions let us find the dot product of the transformed points without having to derive the transformation. This is important in SVMs are kernels are often needed, and goes along with that the dual equation depends on the dot product between two vectors.

Figure 11: Showing that the kernel is equal to the dot product of the transformed points

Figure 12: Proof that the kernel equation gives the same answer as the dot product between two points

What happens if the data you are working with has more than 2 classes? Unfortunately, multiple SVMs have to be created to fix this problem. For example, if you have 3 classes, 2 SVMs will have be be built. Some methods for handling multi-class problems with SVMs include one vs one and one vs rest. Say you have classes ‘A’, ‘B’, and ‘C’. For one vs one, 3 SVMs will have to be created. There will be one for A vs B, A vs C, and B vs C. For this situation, the class that receives the most predictions is the predicted class. For one vs rest, 3 different SVMs will have to be built. There will be one for A vs not A, B vs not B, and C vs not C. Even though it it more work, there are options for if the data contains more than 2 classes.

Data Prep

Since this is a supervised learning method, there must have a label. When using SVMs, the goal will be to if visitor numbers from 2016-2022 can be used to predict the number of activities a given park offers. The dataset before cleaning is shown in Figure 13, with the full dataset being found here. Since the number of activities in each park isn’t a current feature, it had to be created. The distribution of the number of activities offered in the parks is shown in Figure 14. The response must be a label, so after looking at the histogram and summary statistics (the mean and median was 15), any row that had 15 or less activities offered was considered ‘Normal’, and anything more than 15 activities was considered ‘High’. This is now a two class problem since there are only two potential classes for a row to be classified as.

One thing about SVMs is that the features have to be numeric data. This is because, as seen in the overview, SVMs are a lot of math. There is no way to project non-numeric values into higher dimensional space or calculate dot products between vectors that don’t contain numbers. Even if categories are represented by numbers, they don’t have any meaning. That being said, all of the qualitative features in the dataset from Figure 13 had to be removed. After cleaning, the final dataset is shown in Figure 15, with the full dataset being found here.

Since this is a supervised learning method, the data was split into training and testing sets, with 75% of the data going to the training set. These two sets are completely disjoint, which is important when doing supervised learning. By separating the dataset into two sets, this helps avoid overfitting and also this will give you a better understanding of how the model fit it. To truly know how good a model is, it is best to test it on data it has never seen before. To randomly split the data into the training and testing sets, 75% of the rows were randomly sampled from the dataset, and the remaining rows were the testing. Before building various SVMs, I decided to normalize all the data using min-max normalization so everything is on the same scale. An example of what the training data looks like after doing normalization is shown in Figure 16. The counts of the number of rows from each label in the training set are shown in Figure 17, and as we can see they are pretty balanced, which is good.

Figure 13: Snippet for full dataset before cleaning

Figure 14: Distribution of the number of activities offered in the parks

Figure 15: Snippet of the full dataset after cleaning

Figure 16: Snippet of training dataset after normalizing

Figure 17: Counts of each label in the training dataset

Code

All code to complete SVMs can be found here: https://drive.google.com/drive/folders/1dxHQ4laAB2xWGb3yssrX6BoxhGTuZbbe?usp=drive_link. All work was done in Python.

Results

SVMs were built using three different kernels: linear, polynomial, and radial. For each of the kernels, multiple different cost values were set to see how that impacted the accuracy of the model. First, let’s look at the linear SVMs. Three different cost values were looked at, which were 1, 5 and 10. As the cost value increased, the accuracy decreased. The accuracy for the cost = 1 was 60%, which was the highest of the three linear SVMs. The confusion matrix for this model is shown in Figure 18. The model with cost = 5 had an accuracy of 57% and the model with cost = 10 had a very low accuracy of about 39%.

Figure 18: Confusion matrix for linear SVM with cost=1

Next, let’s move on to the polynomial SVMs. For this kernel, six different models were built. Three had degree = 2 and the other three had degree = 3. For the degree = 2 models, the accuracy when cost = 1 was 68%. When cost = 5 accuracy was also 68%, and when cost = 10, accuracy was 67%. Overall, when degree = 2, the cost value didn’t seem to change the accuracy of the model. Moving on to degree = 3 models, when cost = 1, accuracy was 63%. When cost = 5 accuracy was 68%, and when cost = 10 accuracy was 67%. Overall, the degree = 2 models had more consistent accuracies compared to the degree = 3, with the highest being 68%. We can see that the polynomial kernel seems to have higher accuracies than the linear SVM. The confusion matrix for the model with degree = 2 and cost = 5 is seen in Figure 19. Consider ‘Normal’ to be a positive class and ‘High’ to be the negative class. We can see that compared to Figure 18, the polynomial SVM has the same accuracy but a lot more false positives when compared to the linear SVM. That might make the model in Figure 18 to be the better model of the two.

Figure 19: Confusion matrix for polynomial SVM with cost = 5 and degree = 2

Lastly, we will examine the radial kernel. Again, three different models were built, all with different costs. When cost = 1, the accuracy was 67%. When cost = 5 the accuracy was 68% and when cost = 10, the accuracy was about 69%. An accuracy of 69% is the highest accuracy out all of the models built. That being said, the radial kernel with a cost = 10 is considered to be the best SVM. The confusion matrix for that model is seen below in Figure 20. This matrix also has a good balance of false positive and negatives, which is good.

Figure 20: Confusion matrix for best radial SVM

To try to visualize the radial SVM, the data was reduced to 2D using principal components analysis (PCA). After reducing the dimensionality, Figure 21 shows a scatterplot of the test data that is color coded based on the label of the data. Figure 22 shows what the radial SVM with cost = 10 predicted for the test dataset (note the colors for the labels switched). Overall, it predicted the true values pretty well but ran into difficulty around the clump of points centered at about (0,0).

Figure 21: Scatterplot of test dataset after PCA

Figure 22: Predicted test values based off of the radial SVM on reduced data

Conclusions

Overall, SVMs produced decent models for predicting if a park had a normal or high amount of activities based off of visitation numbers. The highest performing model had an accuracy of 69%, and most of the other models created had an accuracy that was within the range of 63-68%. This is not awful but could definitely better. It might be possible that the dataset used to create these SVMs were not linearly separable, even when using kernels. Based off of this analysis, it seems that visitation numbers can predict the number of park activities with decent accuracy, but other predictors might have worked better.

One thing about SVMs is that they are hard to interpret, so it is hard to know if a certain year had more of weight in the prediction vs another year. A decision tree might have worked better for this problem. I also feel that the data that I have for this project wasn’t a good fit for SVMs, since SVMs can only take in numeric data as features. Most of my data was qualitative, so a lot of it had to be eliminated. In order to improve the accuracy of the SVMs, I think more numeric data on national parks would have to be looked at and gathered.