Decision Trees - Isabel Beaulieu

Overview

A decision tree is a supervised learning method that is used to build a classifier that is directional tree structure. To be considered a tree, it must include a root node (has zero incoming edges), internal nodes (exactly one incoming edge and two or more outgoing edges), and leaf/terminal nodes (one incoming edge). Each leaf node is assigned a class label. An example of this is seen in Figure 1, which is very easy to follow and understand how the model arrived at a decision. This tree shows how, for a specific dataset, to determine is someone has high or low risk for heart disease. In addition to interpretability, decision trees are inexpensive to construct, fast at classifying new data, and robust for missing values. Since this is a method of supervised learning, labels for the data are required when building the model. One thing with decision trees is sometimes there can be an infinite amount of trees built for a dataset. When does this occur? Look at Figure 2 for example. In this dataset, there are 3 predictors and a class with 2 options. One of the predictors, taxable income, is continuous. There are infinite ways to split on taxable income since it is continuous. It is impossible for us to create an infinite amount of trees, so the next section describes how decision trees are created.

Figure 1: Example decision tree for determining risk of heart disease

How does a human or computer come up with root nodes or decide how internal nodes should be split? We are always looking for the ‘optimal split’ for each node, and this can be measured in three ways: entropy, gini, and information gain. First let’s define entropy.

Entropy determines how a decision tree splits the data and measures the uncertainty within a group of observations. Consider a dataset that has ‘C’ classes. The formula for entropy is shown in Figure 3, where p_i is the probability of class i. Entropy can range between the values of 0-1. An entropy of 0 indicates that all of the data belongs to the same class, and an entropy of 1 indicates the data is equally distributed between all of the possible classes. The ‘best split’ should have the lowest entropy. As one moves down the tree, the value of entropy should decrease.

Next is gini, or gini impurity. Gini is a measure of the probability of misclassifying a randomly chosen tuple in the dataset. Again consider that a dataset has ‘C’ classes. The formula for gini is shown in Figure 4, again with p_i being the probability of class i. Unlike entropy, gini can only range between values 0-.5. Similar to entropy, lower values of gini are favored over higher ones. Both entropy and gini are similar in terms of results, but gini is less expensive computationally. Figure 5 shows the difference between gini and entropy and makes it clear that gini requires less computational power.

The last thing to cover is information gain, which is a measure of the strength or quality of a given split in a decision tree. It compares the purity of the parent nodes before the split to the purity of the child nodes after the split. The bigger the difference, the better the split was. The formula for information gain is shown in Figure 6. In this equation, “I” represents the entropy or gini (or other impurity measure) of a node. N is the number of rows of data at a parent node and N(v_j) is the number of rows of data in the child node v_j. Lastly k is the number of different attributes.

This is a lot of new information, and it might be hard to see how they all work together to create and assess the quality of a decision tree. Let’s look at a small example to get a better understanding using gini and information gain. Figure 7 shows an example of heart disease risk except expanded further. On the left, there is the dataset which includes the count of the number of rows in each label and the gini of the dataset. Say we are building a tree for this dataset and want to choose the root node to be either age or exercise, but aren’t sure which one to pick. The middle image is an example of how to split the dataset by age, with new ginis calculated for each node and the information gain for that split. The image on the right shows the alternative, a way that the dataset could be split on exercise (note that this could be split different ways since there are 3 options for exercise, so this is just one example). Again the gini for each node and the information gain of the split is calculated. Since the split on exercise has a higher information gain than the one for age, this would be the split to go with! This makes sense when you look at the datasets for each node, as the ones for the exercise split are more close to being pure than those of the age split.

Figure 7: Example of gini and information gain

Data Prep

For making decision trees, R was used since it can handle qualitative and quantitative data as features, whereas Python can only handle quantitative data. For supervised learning, the response has to be a label. In this project, one goal is to get a better understand of what impact visitation numbers in the parks. That being said, the visitor numbers for the year 2022 were converted into a factor variable. To start, it was important to look at the distribution of visitor numbers in 2022, shown in Figure 8. Looking at this histogram, it is clear the data is skewed. To adjust for this, the log of the visitor numbers was considered, shown in Figure 9. This distribution looks much more normal and easier to convert to a factor variable. The best way to factor this was to convert the number to have either a ‘Low’, ‘Average’, or ‘High’ tag. After looking at summary statistics in addition to the histogram in Figure 9, it was decided that if the log of visits was less than or equal to 9, it would be considered ‘Low’. If the log of visits was greater than 9 and less than or equal to around 13.5, it was considered ‘Average’, and anything above 13.5 was considered ‘High’. When making training and testing sets, 75% of the data was randomly placed in the training set. For the training set, the counts of the number of rows in each label can be seen in Figure 10. There are mixed opinions on whether or not the counts for each label in supervised learning should be balanced. In this project, it is okay to proceed with unbalanced labels since information might be lost if the data is corrected to be balanced. Also, this project isn’t as high stakes as something like making a model for a medical diagnosis for example, so unbalanced it okay. If the accuracy isn’t as high as we want though, this will be reconsidered.

Figure 8: Distribution of 2022 visitation numbers

Figure 9: Distribution of the log of 2022 visitation numbers

Figure 10: Counts of the number of rows for each label in the training set.

To prepare the data correctly, some variables had to be removed. A snippet of the dataset before cleaning is shown in Figure 11, with the full dataset being found here. For one decision tree, the predictors were all of the activities, the region, and the park type. A snippet of is shown in Figure 12, with the full dataset being found here. For the other decision tree, only the activities were included as predictors. A snippet of this is seen in Figure 13, with the full dataset being found here. For both of the datasets in Figures 12 & 13, all of the predictors had to be converted to factor variables. Hopefully by using two different datasets to create decision trees, it will be easier to see if park type and region, or if just activities have a bigger impact on how many people visited in 2022.

Figure 11: Snippet of full dataset before cleaning

Figure 12: Snippet of clean dataset with region and park type

Figure 13: Snippet of clean dataset with only activities

As stated above, 75% of the data was used for the training set, and 25% was set aside for testing. These two sets are completely disjoint, which is important when doing supervised learning. By separating the dataset into two sets, this helps avoid overfitting and also this will give you a better understanding of how the model fit it. To truly know how good a model is, it is best to test it on data it has never seen before. To randomly split the data into the training and testing sets, 75% of the rows were randomly sampled from the dataset, and the remaining rows were the testing. 3 different models were built below, each using different training and testing sets, but a snippet of one of the training and test sets is seen in Figure 14. As one can see, the row index are not the same, and the two sets are disjoint.

Figure 14: Snippet of the training set (left) and testing set (right) used

Code

The code used to clean the data and create decision trees is located here : https://drive.google.com/drive/folders/18hEMRwBNDoWP5GOYWWK3qsth5n81f8Le?usp=drive_link. R was used for both cleaning and model building.

Results

First let’s analyze the trees built using the dataset in Figure 12. The first type of tree fit was just a basic decision tree using the class method with no other arguments. The resulting tree is shown in Figure 15 and the variable importance is shown in Figure 16. When looking at the tree in Figure 15, branches on the left indicate ‘yes’, and branches on the right indicate ‘no’ (every tree built will follow this setup). This tree shows that park type is the most important factor in determining 2022 visitor levels, followed by some activities. It appears region doesn’t have an impact. This tree is not that good though, as there are no parks that are predicted to have ‘Low’, when 3 parks in the testing dataset did qualify as ‘Low’. The confusion matrix is seen in Figure 17, and this tree has an accuracy of 76%.

Figure 15: Basic decision tree using park type, region, and activities

Figure 16: Variable importance for basic decision tree

Figure 17: Confusion matrix for basic decision tree

To try to build a better tree, some arguments in the model were altered, like the cp value, parms, and the min split. The tree is shown in Figure 18. One can see that there are many more branches and internal nodes. It looks a bit too complex, but at a first pass, there were some predictions for ‘Low’ which is good. The variable importance is in Figure 19, which is consistent with the basic model, that park type is the most important. Lastly, the confusion matrix is in Figure 20, and the accuracy for this tree is around 69%, lower than the simpler one. This makes sense, as the cp value for this tree was .01, which is very low and probably led to overfitting (hence why Figure 18 is so crowded). So, even though this model did predict some rows to be ‘Low’, they were incorrectly classified and the simple model appears to be better.

Figure 18: More complex decision tree using park type, region and activities

Figure 19: Variable importance for more complex tree

Figure 20: Confusion matrix for more complex tree

Now we will move on to analyzing the dataset shown in Figure 13, which includes only the activities. This will help better understand what activities lead to the amount of visitors for the park in 2022. Using the same process as above for the other dataset, the first tree will be a simple decision tree with no added arguments in the model. The tree for this model is shown in Figure 21, and the variable importance is shown in Figure 22. The confusion matrix is seen in Figure 23, and this model’s accuracy is about 65%. Like the basic tree created above, this one also does not predict any ‘Low’ values, but the accuracy for this one is lower than the one including park type.

Figure 21: Basic decision tree using only activities

Figure 22: Variable importance for basic decision tree

Figure 23: Confusion matrix for basic tree

Lastly, let’s move on to building a more complex tree for this dataset. This was also created similar to above, but instead of using a cp of .01, the value was changed to .03. The results for this tree are seen in Figures 24-26. The accuracy for this tree is about 67% and also does not predict any to have a ‘Low’ visitation. This model performs a little bit better than the model in Figure 21. Since it seems like park type had high importance in the first two decision trees built, a last decision tree was built using only park type and regions as predictors from the dataset in Figure 12. The decision tree for this model is in Figure 27 and the confusion matrix is seen in Figure 28. This tree had an accuracy of about 77%, which is the highest performing tree (only 1% higher than the tree in Figure 15). When looking at the decision tree, it makes sense that parks that are not battlefields, historic sites, or memorials (others as well) have higher visitation numbers. It also makes sense that of the more popular parks, the ones located in Alaska don’t have as high of visitor numbers.

Figure 24: More complex decision tree using only activities

Figure 25: Variable importance for more complex tree

Figure 26: Confusion matrix for more complex tree

Figure 27: Decision tree using only region and park type

Figure 28: Confusion matrix for tree with only park type and region

To see if balancing the data would help improve the accuracy of the tree in Figure 27, a balanced dataset was created. Figure 29 shows a much better balance of labels in the training and testing set. After repeating the same process used to create the tree in Figure 27 with the balanced data, the accuracy surprisingly went down by about 14%. What this tells me is that the patterns in the data are very complex and there isn’t enough data for the model to properly learn. Another thing this could mean is that when trying to classify the visitor numbers as being low, average, or high, there aren’t any patterns in the given data to be able to confidently do this.

Figure 29: Counts of each label in the training (top) and testing dataset (bottom)

Conclusions

Overall, the best performing decision tree was created using just park type and region as predictors, with the tree using those same attributes and activities was the second best performing. When predicting visitation numbers in 2022 based off of the tree in Figure 27, if a park is a National Parkway, National River, National Lakeshore, National Seashore, National Recreation Area, National Park, or National Preserve and not in the Alaska region, it is classified as having high visitor numbers. Of course, all of these trees were built looking at the log of visitor numbers for 2022, but if trends stay the same, they will be accurate for future years. If more time allowed, it would be interesting to build trees for every year that we have data for to get a more complete picture.

Based on the trees built, without overfitting, the trees won’t classify any park as having low visitor numbers. I would look at these trees are useful for distinguishing if a park has high numbers or not. When looking at variable importance, it appears that park type is the most important factor in determining visitor numbers, and region doesn’t seem to matter at all (unless it is in the Alaska region in the simplest tree in Figure 27). Even though activities weren’t as good of a way as determining visitation numbers, if a park has biking or horse trekking it is more probable that the park has higher than average visitor numbers.

Since making the data balanced didn’t help improve accuracy, the features in this dataset might not be suitable for predicting visitor numbers for 2022. Perhaps this dataset should have a different label, which will be explored in Naive Bayes.