Logistic Regression - Isabel Beaulieu

Overview

Logistic regression is a supervised learning technique used for classification and prediction. Logistic regression models are used to model a binary response (only having two labels, like yes or no or 0 or 1). The equation for logistic regression is z = w1x1 + w2x2 + … + wnxn + b = w^Tx + b, where w is a vector of weights, x is the input vector, b is the bias, and z is the predicted class. Figure 1 shows an example of a dataset that falls into classes 0 and 1. As you can see, a linear regression model is not a good fit for this data since the data is not linear. Logistic regression fixes this issue by using something called the Sigmoid function. Figure 2 shows what this function looks like, and it is easy to see that is a better fit for a binary classification problem. The Sigmoid function is defined as S(z) = 1 / (1 + e^-z) = e^z / (1 + e^z) , and this is found by taking the log odds of z, and solving for z (z = w^Tx + b now is log(z/(1-z)) = w^Tx + b). log(z/(1-z)) is the log odds and known as the logit function. By applying the sigmoid function to z, we find that S(z) = e^(w^Tx + b) / (1 + e^(w^Tx + b)) which will produce a value between 0 and 1. If S(z) >= .5, the data is classified as class 1, and if S(z) < .5 the data is classified as class 0.

Figure 1: Image showing linear regression is not a good fit for this binary problem

To solve for the weights, something has to be minimized. For example, in linear regression to estimates the betas, the mean squares error (MSE) is minimized. Due to the Sigmoid function, the prediction equation is non linear, so cross-entropy (or log loss) is minimized. The equation for cross-entropy can be seen in Figure 3. In addition to minimizing this loss function, we also want to minimize the difference between what is predicted and what is right. To solve this optimization problem, the gradient of the cross-entropy function must be found. Figure 4 shows the partial derivatives of the cross-entropy function with respect to w and b where m is the number of rows in the dataset. To find the best model, an initial w and b will be set and then after each iteration they will be updated with the functions shown in Figure 4 (example, after each iteration the new b will be equal to (y^ – y)/m). After a set number of iterations, the optimal values for w and b will be selected and the logistic regression model will be complete.

Figure 4: Partial derivatives of the loss functions for w and b

There are a few disadvantages to logistic regression. One is that is cannot predict continuous variables. Although logistic regression does not require a linear relationship between the predictors and the response, the independent variables need to be linearly related to the log odds of the dependent response. It may be hard to tell if this exists.

Data Prep

In order to perform binary logistic regression, the data has to be numeric and there has to be two possible classes. The dataset that was used for logistic regression can be seen in Figure 5, with the full dataset being found here. For simplicity, I decided to build a model that only had two predictors. To find what would be good predictors, some different scatterplots were made to see what kinds of relationships existed. The response in this model was if something was a National Park or not. To do this, for every row where Park Type was National Park, the label was made a 1. If it was not a National Park, it was set to 0. After making scatterplots, the log of the number of visitors in 2022 seemed to potentially have a relationship with whether or not something was a National Park or not. I also decided to create a new column. I decided, for each row, to add up the total number of activities offered in that park. A scatterplot of the relationship between the number of activities offered and if it is a National Park or not is shown in Figure 6. It appears there is some sort of relationship and a logistic model might be a good fit.

Figure 5: Snippet of data before cleaning

Figure 6: Relationship between number of activities and if a park is a National Park

After picking the log(2022) and the number of activities as predictors, I created a new dataset that contained only those two columns along with the label. Since these data are on very different scales, they were normalized using min-max normalization. Before building the model, the data was split into training and testing sets, with 30% of the data held out for testing. The labels were also very unbalanced (only 20% were 1), so a subset of 60 randomly labeled 0 vectors were chosen to make the labels more balanced. A snippet of the clean dataset used for modeling is seen in Figure 7.

Code

All code to complete logistic regression can be found here: https://drive.google.com/drive/folders/1pfBI4DJfOBiKkT6L0h-SVbtkXX0Vb6I4?usp=drive_link. All data cleaning and model building for logistic regression was done in Python.

Results

To create this logistic regression model, the initial weight vector was set to be [1,1] and the bias (b) was set to be 0. 350 iterations of updating the bias and weights were completed. After running through that many iterations, the final weight vector was [ 4.51930423 -2.19552123] and the bias was set to be -0.22767938437614457. The updated Sigmoid function can be found in Figure 8, where x1 is the number of activities and x2 is the log(visitors in 2022). This function finds the probability that the data is in class 0 or 1. If S(Z) < .5, it is class 0, and if S(Z) >= .5, the class is 1.

Figure 8: Sigmoid function used for predictions

After creating this model, it was testing on the testing data. It had an accuracy of 84%, with the confusion matrix shown in Figure 9. Overall, it did a pretty good job and didn’t misclassify too many rows. I thought it would be interesting to plot the testing data on a 3D plot, colored by how they were predicted. This plot is seen in Figure 10. The two axes on the bottom represent the number of activities and log(2022 visits), and the vertical axis is the response. This data shows the same information as the confusion matrix (we can see when the label = 1, that 3 points were labeled 0, which is seen in the bottom left corner of the confusion matrix), but it is interesting to see the plot in 3D. Overall, it appears that this relationship seemed to be represented well using this logistic regression model.

Figure 10: 3D plot of testing data colored by predicted class

Conclusions

Overall, this logistic regression model including the number of activities and the log(2022 visitors) is a good predictor of whether or not a park is a National Park or not. Since the model had a high accuracy of about 84%, this is a good indication that there is a relationship between the number of activities a park offers and 2022 visitor numbers and if the park is a National Park or not. To dig deeper, it would be interesting to build different models for every year that there is visitor data for to see if the accuracy is similar. Based off of the plots in Figures 6 & 10, it appears that parks that offer more activities have a greater probability of being a National Park than not. Being able to predict if something is a National Park or not based off of these two features gives insight into characteristics about National Parks. We can see that National Parks tend to have more visitors and activities offered compared to other types of parks.