Naive Bayes - Isabel Beaulieu

Overview

Naive Bayes is a supervised learning algorithm used for classification that are based off of Bayes’ Theorem. For naive bayes, all predictors are assumed to be independent. To get a better understanding of Naive Bayes, it will help to review Bayes’ Theorem. Shown in Figure 1 is Bayes’ Theorem which shows that to find the probability of event A occurring given that event B has occurred, find the probability of event B given event A multiplied by the probability of event A, divided by the probability of B. In this formula is important to note that P(B) cannot equal 0.

How does Bayes’ Theorem relate to a Naive Bayes classifier? Multinomial Naive Bayes classifiers find the probability of each class given a row of data. If the class is c_i and the row of data is x1…xn, then the probability of c_i given the data is shown in Figure 2. In machine learning, the denominator is typically dropped since it is the same for all classes, which gives us the equation in Figure 3. The goal of the classifier is to maximize that probability for all classes.

Figure 2: Probability of a class given a set of features

Figure 3: Naive Bayes classifier formula

Let’s look at an example in Figure 4 to get a better understanding. This pretend dataset has three features and two potential classes. The goal of a Naive Bayes classifier is to predict the class for a given set of features. If one wants to know what class a 27 year old non married female will be in, the steps are below. The classifier will predict the class where the probability is the highest. In this example, the probability was higher when the class was ‘Yes’, so that is the predicted class. This example also shows what happens when an event doesn’t exist. For example, P(Age = 27 | C = No) = 0 since for all of the rows where C = No, there is nobody that is 27. When something like this happens, then entire probability will be 0, no matter what. In practice, it isn’t uncommon for this to occur. Ways to fix this problem involve a smoothing technique where these probabilities are replaced with a very small number. There are different ways to do this, but one example is the Laplace method, with an example shown in Figure 5.

Figure 4: Naive Bayes classifier example

For this project, standard multinomial Naive Bayes will be used. There are also other types of Naive Bayes classifiers, one being bernoulli. In multinomial Naive Bayes, the features can take on more than one value. For example, in Figure 5 age can take on any integer value. These feature values are more or less representing frequencies. In bernoulli Naive Bayes, the model is only looking for the presence or absence of a feature. That is, the column values for the features can only take on values of 0 or 1 (not present or present).

Data Prep

Since the Decision Trees didn’t produce the best results, some different data was used for Naive Bayes. Since R is being used, both numeric and factor variables can be included in the models. Figure 6 shows the dataset containing region, park type, visitor numbers, and activities in each park before being cleaned, with the entire dataset found here. Instead of trying to predict visitor numbers like in Decision Trees, Naive Bayes will be used to predict the number of activities in each park. To do this, the number of activities offered in each park was found and the distribution of the dataset was plotted in Figure 7. A park was considered to have a ‘Low’ amount of activities if it offered less than 8, it had an ‘Average’ amount if it offered between 8 and 25, and any park that offered more than 25 activities was considered ‘High’. For Naive Bayes, people recommend that if a column is numeric, it should be an integer, so all of the columns from 2016-2022 in Figure 6 were converted to integers. The columns of park name and states were also removed. A snippet of the cleaned dataset is shown in Figure 8, with the full dataset found here. The dataset in Figure 8 will be used to build two different Naive Bayes models. One will include all of the features to predict how many activities each park has, and the other model will focus on only using the activities, region, and park type to predict the number of activities.

Figure 6: Snippet of full dataset before cleaning (non acre data)

Figure 7: Distribution of number of activities in parks

Figure 8: Non acre dataset after cleaning

Park acreage data will also be used to create a couple Naive Bayes models. Figure 9 shows the acreage data before cleaning, with the full dataset found here. This dataset contains some acreage data for each park along with park type, region, visitation numbers and activities offered. Like above, the label that will be classified for this dataset will be the number of activities for each park (low, average, or high), calculated the same way as it was for the previous dataset. All of the visitor numbers and acreage metrics were converted to integers. The park name and states columns were removed as well. A snippet of the clean dataset is found in Figure 10, with the full version found here. The dataset in Figure 10 will be used to create two different models, both predicting the number of activities. The first model will contain all of the features, and the second model will contain all of the activity columns, park type, region, NPS.Fee.Acres, Subtotal.Federal.Acres, Private.Acres, and Gross.Area.Acres.

A total of four different Naive Bayes models were built. For each model, the data was split into training and testing sets, with 75% of the data going to the training set. These two sets are completely disjoint, which is important when doing supervised learning. By separating the dataset into two sets, this helps avoid overfitting and also this will give you a better understanding of how the model fit it. To truly know how good a model is, it is best to test it on data it has never seen before. To randomly split the data into the training and testing sets, 75% of the rows were randomly sampled from the dataset, and the remaining rows were the testing. Figure 11 shows an example of one of the training sets and Figure 12 shows some of the testing data for that model. The row indexes show the two sets are disjoint. Figure 13 shows the counts of each label in each set. Overall they are pretty balanced and shouldn’t cause any issues when modeling. Different training and test sets were used in each of the four models, but the same process was followed. For all models created, the parameter ‘laplace = 1’ was input into the model to avoid the issue of zero probability.

Figure 13: Counts of each label in each set

Code

The code used to clean the data and create decision trees is located here : https://drive.google.com/drive/folders/1g2q17_zLHc-dSm3C7nIUxv870mIjB28o?usp=drive_link. R was used for both cleaning and model building.

Results

First let’s start by looking at the models built using the dataset in Figure 8. The first model built contained region, park type, and all of the activities as predictors. The confusion matrix for this model is shown in Figure 14 and this model had an accuracy of 88%, which is great! Figure 15 shows the box plot of the probabilities the model predicted when each class what chosen. For example, when ‘Average’ was selected by the model, the mean probability for that class was about 30%. It will be interested to compare this plot to a model that didn’t perform as well.

Figure 14: Confusion matrix for model using region, park type, and activities as features

Figure 15: Box plot of probabilities the model predicted for when it predicted each class

The next model built using the same dataset contained all of the same predictors as above, but also the visitor numbers for the years 2016-2022 as predictors. The accuracy for this model was 68% and the confusion matrix can be seen in Figure 16. Now we will move on to building models using the data from Figure 10. The first model built using this data contained all acre metrics, region, park type, all visitation metrics, and all activities as predictors. The accuracy for this model was about 53% and the confusion matrix is shown in Figure 17.

Figure 16: Confusion matrix for model using region, park type, activities, and visitor numbers as features

Figure 17: Confusion matrix using all of the acre data

The last model built was also using some of the acre data. Instead of using all of the predictors used in the previous model, this one contained NPS.Fee.Acres, Subtotal.Federal.Acres, Private.Acres, and Gross.Area.Acres, region, park type, and all of the activities as features. The reason for picking only some of acre metrics is because some of the columns are dependent on one another (for example, Gross.Area.Acres is the sum of all of the previous acre columns). The accuracy for this model was about 58% with the confusion matrix shown in Figure 18. The probabilities for each class selected by the model are shown in Figure 19. This plot is interesting because as we can see in the matrix, there were a lot of ‘Average’ labels that were misclassified as ‘Low’. When looking at Figure 19, we can see that sometimes the probability for predicting ‘Low’ was high. This could mean that the model was overly confident something was ‘Low’ when it was in fact ‘Average’. Comparing this to the model built in Figure 15, that model had a much higher accuracy. When comparing Figures 14 & 15, there is not any indication of the model being overly confident in the wrong prediction.

Figure 18: Confusion matrix for simplified acre model

Figure 19: Box plot of probabilities the model predicted for when it predicted each class

Looking at the best model, created in Figure 14, let’s see if there are any patterns in the predicted classes. Figure 20 shows some bar plots, each showing the count of park types that are associated with each predicted class based on the best model. This is interesting as it gives some insight on the types of parks that are associated with having a ‘Low’, ‘Average’, or ‘High’ amount of activities. Most national parks have a lot of activities when compared to the rest, and national historic parks and sites tend to have a lot less than average. To see how this relates to visitor numbers, see Figure 21. For the parks in each of the predicted categories, the mean visitor numbers in 2022 was looked at. It appears that having a low amount of activities is associated with a higher amount of visitors. This was an unexpected result. For parks that fall into other park types though, having more activities leads to more visitors.

Figure 20: Bar graphs showing the counts of park types that were associated with each predicted class for the best model

Figure 21: Mean 2022 visitor numbers for parks classified as each label

Conclusions

Doing Naive Bayes produced better classification models than Decision Trees, but that was most likely due to the different response variable used. Separating each park into having a ‘Low’, ‘Average’, or ‘High’ amount of activities offered produced pretty balanced data. The best performing model had an accuracy of 88% and the features were park type, region, and whether or not the park contained each activity. Based off of the accuracy, it is safe to say that these features do a good job at predicting the number of activities.

Looking deeper into the behind the scenes of the best model in Figures 20 & 21, we can see a pattern between parks that offer a different amount of activities. For example, parks that fall into the national historic park or national memorial category offered the lowest amount of activities, and we can see that in 2022 on average these parks saw the most visitors. There is more to be explored with this. Is it because they are more accessible, in terms of offerings or ease of getting to? This was an unexpected result and told us a lot of information on if activities is the main driver of visitation numbers. The next step in this would be to look at which activities are most popular in these types of parks and compare them to those of the other park types. Some ARM might be helpful in this case. Overall, Naive Bayes worked well and allowed for a lot of information to be learned regaurding activities and visitor numbers.