Clustering - Isabel Beaulieu

Overview

Clustering in an unsupervised machine learning algorithm that is used to discover if groups or categories appear in the dataset and if they do, what the groups are. It can be used to create labels for the data, and can be used to predict or classify new data. Clustering is also a way to reduce dimensionality in the dataset. There are multiple different clustering techniques, but two common ones are partitional and hierarchical clustering.

Partitional clustering divides the data into non overlapping groups. Each row of the data is only in one cluster and each cluster must contain at least one point. One example of this is k-means clustering. The user will choose k, and k initial cluster centroids are randomly selected. Next, all the points are placed into the closest cluster by a chosen distance metric. Figure 1 shows examples of different distance metrics used for clustering algorithms. After points are placed into the clusters, the cluster centroids are recalculated. Next, the distances for each point and the new cluster centroids are calculated, and this process is iterative until there is little to no change happening. Figure 2 shows the final result of a k- means clustering algorithm.

Figure 1: Examples of different distance metrics

Figure 2: Before and after of k-means clustering

Hierarchical clustering is different from partitional clustering, as it creates a hierarchy of clusters. A user does not tell the computer how many clusters there should be, the algorithm decides. There are two ways to do hierarchical clustering: agglomerative (AGNES) and divisive (DIANA). AGNES is a bottom up approach, and each row is initially its own cluster. At the next step, the two clusters that are closest to each other are combined into one cluster, and this process is repeated. DIANA on the other hand is a top down approach. This starts with all points being in the same cluster. At the next step, the cluster is split into two clusters, and this process is repeated. Figure 3 shows a dendrogram. This is how hierarchical clustering is visualized and it shows the difference between AGNES and DIANA.

Figure 3: Dendrogram of hierarchical clustering

There are several different clustering methods for hierarchical clustering. Methods include complete/maximum linkage, single/minimum linkage, average/mean linkage, centroid linkage, and ward’s minimum variance method. Complete/maximum linkage computes all pairwise dissimilarities between two clusters and considers the largest value the distance. This leads to more compact clusters. Single/minimum linkage computes all pairwise dissimilarities between two clusters and considers the smallest value the distance, which leads to loose clusters. Average/mean linkage computes all pairwise dissimilarities between two clusters and the average value is the distance. Cluster types with this method will vary depending on the data. Centroid linkage computes the dissimilarity between the centroids of two clusters. Lastly, ward’s method minimizes the total within cluster variance. The two clusters with the smallest distance are merged, creating more compact clusters. The method used in this project will be whatever method yields the best results.

For this project, clustering will be done on the acreage measures and visitation numbers for the parks to see if the clusters created are similar to how they are already labeled. For example there are 7 different regions, so it will be interesting to see if hierarchical clustering and k-means clustering separates the dataset into 7 different clusters. It will also be interesting to see if parks that are clustered together share similar activities offered. There is also a dataset that contains word counts for each park’s description. It will be interesting to see how that data clusters and if similar park types are clustered into the same cluster. In Python when doing k-means, euclidean distance will be used, and in R for hierarchical clustering cosine similarity will be used as the distance metric.

Data Prep

Clustering requires unlabelled, fully numeric data. For the text data, k-means in will be done in Python. The current form of this dataset is shown in Figure 4, with the full version found here. To do clustering, the park name and park type label have to be removed. After doing this, the data looks like Figure 5 (this was not saved as a CSV, just updated in python code. In addition to the text data, k-means will also be done on acreage and visitation data. This dataset has a lot of features, with a lot being categorical, which need to be removed. The data before being cleaned is shown in Figure 6, with the entire dataset found here. After removing labels and only keeping all the acre columns and visitor columns, the final dataset looks like Figure 7, with the entire dataset found here. When doing hierarchical clustering in R, the dataset from Figure 7 will be used, so the cleaning process is complete. In Python & R all variables were converted to numeric if they weren’t already.

Figure 4: Snippet of word count dataset before cleaning

Figure 5: Cleaned text data to be used for k-means

Figure 6: Snippet of acre/visitor data before cleaning

Figure 7: Snippet of acre/visitor data cleaned for clustering

Code

All code to complete clustering can be found here: https://drive.google.com/drive/folders/1YbRsHKfP9CMmFwnwVHK0qCImETsNQvnI?usp=drive_link. K-means was done in Python and hierarchical clustering was done in R

Results

The first thing to analyze is the k-means clustering done on the text data. Since this dataset had 50 columns, when doing k-means, principal component analysis (PCA) was done to reduce dimension. The data was reduced to two dimensions and scaled. For this data, there were 46 unique park types, but a lot of them only had a frequency of one. That being said, k was originally estimated to be between 7-15 clusters. To find the optimal number of clusters, silhouette score was used. Silhouette score is a measure of clustering fit, with a higher score indicating good fit. The plot of the silhouette scores for k = 5 through 45 is shown in Figure 8. This plot indicates that k=11 is the optimal number of clusters for this dataset. After choosing k, 35 iterations of k means was plotted. The result of the final iteration is shown in Figure 9. Looking at this figure, the results seem pretty good, there are points that look to be clustered out of place. To see if there is a different in park description based off of park type, the original labels were compared to the clusters, as shown in Figure 10. As mentioned before, there were 46 different park types, but a a lot of them only occurred once or twice in the dataset. To make visualizing this easier, all park types that appeared less than 10 times were re-labeled as ‘other’. Looking at Figure 10 it appears that some clusters, such as 0, 1, 9, and 10 contain various amounts of different park types. Compare these to cluster 4 for example, as a majority of the data points in that cluster are of type ‘National Park’.

Figure 8: Silhouette scores for different values of k

Figure 9: Final k-means clustering result when k = 11 for the text data

Figure 10: Counts of park types in each cluster for text data

Next, k-means was done again but this time on acreage & visitation data. To make visualizations easier, PCA was done to reduce the data to two dimensions and scale the data. Not knowing what to expect for clusters, the silhouette method was used again to determine the optimal number of k. This method suggests the best number of clusters for this dataset is 3, as seen in Figure 11. The silhouette score for k = 3 was .8, k = 4 was .79, and k = 5 was .73. There is really no significant difference between 3 and 4, but 3 will be used since it is the highest. The clustering results with the actual labels are shown in Figure 12 & 13. Looking at these figures, there doesn’t appear to be any association between park type or region and what cluster the data falls into. Wanting to understand more about what type of data belonged to each cluster, the full dataset was looked at for each cluster. Figures 14, 15 & 16 show samples of visitation numbers for data in clusters 0, 1 and 2 respectively. When looking at this data, it is easy to see that the data was likely clustered based on these values. Visitation numbers in cluster 0 are smaller compared to those in cluster 1, and both of these are smaller than the numbers that are in cluster 2.

Figure 11: Silhouette scores for various values of k for the visitor/acre data

Figure 12: Counts of park types in each cluster for acre/visitor data

Figure 13: Counts of regions in each cluster for acre/visitor data

Figure 14: Sample of visitation numbers for data in cluster 0

Figure 15: Sample of visitation numbers for data in cluster 1

Figure 16: Sample of visitation numbers for data in cluster 2

Lastly, hierarchical clustering was done on the acre/visitor data in R to see how it would compare with the k-means results above. Since PCA was used above, it was also used in this analysis to reduce the data to two dimensions as well as normalize the data. To determine the number of clusters, a few different methods of hierarchical were tested: ward’s method, single linkage, and complete linkage. The dendrograms shown for these methods are shown in Figures 17-19. When looking at these dendrograms, each method could potentially have a different number of optimal clusters. The red lines in each image represent what the clusters would be. For ward’s method, it appeared 4 was the best, for single linkage it was 3, and for complete linkage it was harder to tell, but 4 seemed most appropriate. Comparing this to k-means picking 3 as the optimal number of clusters, hierarchical clustering produces pretty similar results. To learn more about the clusters, the data was filtered according to the clusters created by Ward’s method and some interesting visuals were created. First, visitation numbers in 2022 and total area of the parks in acres were compared for each cluster, shown in Figures 20 & 21 respectively. These two boxplots show that more people visited parks in cluster 1 and the least amount of people visited parks in cluster 2. When looking at acres, parks in clusters 1 and 4 are bigger than those in 2 and 3. Figures 22 & 23 show how the clusters are related to region and park type. It appears that almost all of the parks in the Alaskan region appear in cluster 4, and most of the Southeast region appears in cluster 2. Moving onto park type, cluster 2 is represents all of the National battlefields, battlefield parks, and historic sites, as well as a majority of the historical parks. It also have a very large percentage of the national monuments. This makes sense, as those are the least visited and are the smallest size. Looking at cluster 1 in Figure 23, it it mainly National parks, which also makes sense why this cluster also saw more visitors in 2022 than any other cluster.

For fun, I thought it would be interesting to look at a few activities to see how those were related to the clusters. Figures 24-26 show the counts of parks that do and don’t offer mountain climbing, surfing, and shopping as activities respectively. Only clusters 1 & 4 have parks that offer mountain climbing (might indicate why more people visit those parks in cluster 1). When looking at surfing, cluster 2 doesn’t contain any parks that offer this activity. Lastly looking at shopping, all clusters have parks that offer shopping, but the proportion of parks that don’t is much higher in cluster 2 compared to the other clusters.

Figure 18: Dendrogram for single linkage

Figure 19: Dendrogram for complete linkage

Figure 20: Boxplots of visitor numbers in 2022 for parks in each cluster

Figure 21: Boxplots of total acres for parks in each cluster

Figure 22: Counts of parks by region in each cluster

Figure 23: Count of parks by park type for each cluster

Figure 24: Count of parks that do/don’t offer mountain climbing in each cluster

Figure 25: Counts of parks that do/don’t offer surfing in each cluster

Figure 26: Counts of parks that do/don’t offer shopping in each cluster

Conclusions

Doing both k-means and hierarchical clustering was a great way to find out similarities between the data. The first takeaway was that the words in the parks descriptions don’t seem to be drastically different depending on what type of park it is. Of course all of the descriptions are different, but when looking at the clustering graph, it was harder to see distinct groups. There was more to learn when doing clustering on the visitor/acre data. Both methods concluded that the data can accurately be clustered into about 3 or 4 clusters. When doing hierarchical clustering in R, it was easy to see that parks in clusters 1 and 3 had more visitors than those in clusters 2 and 4 in 2022. One of the main goals of this project is to determine what impacts the number of people that visit each park. Seeing that there are 3-4 distinct clusters, figuring out what impacts visitation numbers might be easy if when doing analysis the data has the cluster labels. Building plots of different activities was very insightful and further analysis with more activities might lead to a conclusion on what activities impact visitation the most. Overall, it appears that the data can be clustered and in both methods, the main deciding factor is the number of visitors in the park.