Overview
ARM is an unsupervised machine learning algorithm that assists in data discovery. The purpose of ARM is to evaluate transactions and find associations between them. The most common way to do ARM is in R and transaction data is needed. More about this will be discussed in the next section. ARM finds certain rules in the dataset, which are ‘if-then’ associations in the dataset. For example, if the rule is “Cheese => Bread”, that is showing that if someone buys cheese, how likely they are to also buy bread.
For each rule created in ARM, support, confidence, and lift measures are displayed. Figure 1 shows the difference between these different measures. Consider the rule “X => Y”. Support measures that if an item is frequent, what is it frequently associated with. Support is not a good measure to use if an item, say X, is rare since it takes into account the total number of transactions in the dataset. If an item is very rare, confidence will be a better measure. In this example, confidence measures how often X and Y appear together relative to the transactions that contain X. Looking at the equations for these measures, confidence always has to be greater than or equal to the support.
The last measure is lift. Lift is a way to see how relevant the support and confidence metrics are for a given rule. If the lift is equal to 1, X and Y are independent and therefore are not associated. If lift is less than 1, X and Y are negatively correlated and the rule isn’t relevant. Only rules with a lift greater than 1 are considered positively correlated and interesting. It is possible to have rules with high support and confidence measures, but a low lift, possibly because the items are common all over the dataset.
How does R compute metrics for all of these rules? It certainly would take a long time to look at every possible rule and the metrics associated with it. That is where the apriori algorithm comes in. Looking at Figure 2, the apriori algorithm will start at the top and work its way down. If a rule is found to not meet a minimum support threshold, then all supersets of that rule also won’t meet the threshold and are therefore skipped. For example, if the rule “Cheese => Bread” was found to be infrequent (meaning that if someone buys cheese, it is unlikely that they will also buy bread), then by default, the rule “Cheese => Bread, Milk”, will also be infrequent. Figure 2 shows how many possible itemsets are when there are only five distinct items. The dataset in this project that will be used with ARM contains over 60 items, so this algorithm has to be used to be order to do analysis.
In this project, ARM will be used to look at the associations between different park activity offerings. If a park has a certain activity such as Hiking, is it more likely to also offer Stargazing? Things like this will be looked at. It will also be interesting to look at if certain regions are associated with certain types of activities.
Data Prep
As stated above, to do ARM, transaction data is needed. Transaction data is different than record data, and can be represented basket, single, or matrix format. Examples of transaction data can be seen in Figure 3. For ARM, a transaction ID/row number is not needed. It is also important when doing ARM that there are no numeric values in the dataset. R & Python will do ARM with numeric values in the dataset, but this is incorrect and needs to be fixed before continuing. Since ARM is an unsupervised method, there shouldn’t be any labels included. One thing that is nice about ARM though, is that you can include a label in a given transaction if you want to see how it is associated with other things. For this project, ARM will be done without a label to see associations between different activities offered in parks, but also again with the label of region to see if region is associated with any specific activites.
Currently, the dataset is in record format with labels, a snippet shown in Figure 4, with the full dataset found here. As stated above two transaction files will be created: one with all of the activities in each park, and one with all of the activities in each park and the region of the park. To get this data into basket format for the first file, all of the columns that are not an activity name have to be removed. Then, if the row has a value of 1 for a certain activity, then the column name is added to the transaction for that row. A snippet of the final transaction file is found in Figure 5, with the full file found here. The same process was followed for the second transaction file, except the region for each park was included in each transaction, as shown in Figure 6, with the entire dataset being found here.
Code
The code used to clean the data and perform ARM is located here : https://drive.google.com/drive/folders/1Fyly60j8Kk_jt6T0NB_2aiaMvu2cG7gi?usp=drive_link. Python was used to clean the data and R was used to perform ARM.
Results
First, here are the results for just the activities (regions not included). To do the apriori algorithm, minimum support and confidence thresholds are needed. Given that there are about 300 transactions and over 60 different activities that could be offered at a park, higher thresholds were needed. Using a minimum support of 50% and a minimum confidence of 75% output about 50 different rules, which is more than enough to do analysis on. To start, let’s look at the top 15 rules based on support (Figure 7). For the top rules guided tours, junior ranger program, shopping, bookstore and park store, and hiking are the only activities included in all of the rules. This makes sense, especially after looking at Figure 8, which shows how often the activities occur in the dataset. The top 15 rules by support only include the top 5 activities. One thing about support, is that if the item is frequent in the dataset, it will have a high support, even if it might not be a relevant or meaningful rule. In this case, all of the lifts are above one, so all of these rules happen to be meaningful.
Now let’s look at the top 15 rules by confidence, shown in Figure 9. The first thing to note is that there are rules that have a confidence of one. This means that for parks that have every activity on the left hand side (denoted as ‘lhs’ in the R output), they will also include the activity on the right hand side (denoted as ‘rhs’) 100% of the time in this dataset. So for example, all the parks that have bookstore and park store and guided tours in this dataset also have shopping as an activity. It is also interesting to see that when sorting by support, all of the lifts are above one, but when sorting by confidence, the lifts are higher. Figure 10 shows the top 15 rules sorted by lift, which again is a measure of how relevant the rules are. Notice how the top two rules when based on confidence and lift are the same. Figure 11 is a visual representation of the top 15 rules sorted by lift in a network.
Lastly, it is time to examine what activities certain regions are associated with. To sort through the dataset, the same thresholds as above were used (min support of 50% & min confidence of 75%). Based on the preliminary exploratory data analysis done prior, the two most frequent regions are Intermountain and Northeast, so these will be the areas of focus. Figure 12 and Figure 13 show the top 15 rules sorted by lift setting the left hand side to be Northeast Region and Intermountain Region respectively. It is interesting to see rules that have more niche activities compared to just looking at the general dataset. For example, if a park is in the Northeast region, it is more likely to have geocaching, followed by live music and golfing. Comparing this to the Intermountain region, parks in this region are more likely to have canyoneering followed by jet skiing and water skiing. For northeastern parks, they are likely to have saltwater swimming and for intermountain parks, they are more likely to have rock climbing. All of these things make sense when taking into account the geographical properties of these regions (an intermountain park wouldn’t have saltwater swimming since the intermountain region is landlocked). Figure 14 is a network visualization of Figure 12. Since the left hand side was set to be the same, the network looks very different than the one in Figure 11.
Conclusion
Doing ARM on the national park activities dataset was helpful in understanding if certain activities are present together in the same park. When looking at the top 15 rules sorted by support, confidence, and lift, most of the activities contained in the rules also were in the top 20 most frequent activities. That being said, not too much new information was learned about the activities in general. It might be interesting later on to look at if parks that don’t follow some of the top 15 rules by lift have vast differences in visitation numbers compared to the parks that do follow the rules. One thing that ARM pointed out that would of been hard to see otherwise were the two rules that had a confidence of 1. It is now known for all of the national parks in the United States, if the park has bookstore and park store and guided tours as activities, it was also have shopping.
The analysis done by fixing specific regions as the left hand side of the rule gave a lot more insight to the general analysis. Doing this was a great way to see the different types of activities that are commonly found in parks in different regions. A lot of the lift measures for the top 15 rules for the Northeast and Intermountain regions were higher than the lift measures for the top 15 rules for the general activities. Between these two regions, they only shared on common activity in the top 15, which was auto off-roading. This ARM analysis with regions will help down the road when looking at visitation trends for different regions. If a certain region has a lot more visitors than the rest, that could be an indication that people favor certain activities more.