Initial exploratory data analysis offered a look into what data there was to work with and some of the patterns and trends found in it. In terms of activities, the most commonly offered activities are the junior ranger program, guided tours, shopping, and bookstore and park store to name a few. Some of the parks that offer the most activities are Great Sand Dunes National Park & Preserve with a little over 50 activities, followed by Olympic National Park and Grand Teton National Park. Out of the top 10 parks offering the most activities, 7 of them were in the Intermountain or Pacific West regions, which may suggest these regions offer more due to the geological makeup of these parts of the country. For all of the parks recognized by the National Park Service (NPS), National Monuments and National Historic Sites are the most occurring park types. When focusing on National Parks, Great Smoky Mountains National Park is the most visited in multiple years by a lot. In terms of top visited parks each year, parks that had a lot of visitors in 2021 also had a lot of visitors in 2022. Overall, Blue Ridge Parkway was the most visited park in both of these years. When looking at the top 5 most visited National Monuments vs National Parks in 2022, National Parks saw a lot more visitors.
Methods such as clustering and association rule mining (ARM) provided insight into some relationships between parks that were not obvious or labeled. When performing clustering on the acre and visitor data, the data wasn’t clustered based on park type or region, but rather by the number of visitors. It appears that in the data there are groups of parks that have a low number of visitors, average number of visitors, and high number of visitors. Clustering showed that park type and region don’t play into the visitor levels when doing it in Python. When clustering in R, there did seem to be a difference in visitation depending on region and park type, which lead to more exploration down the road. The clustering results were a driver when thinking about how to perform decision trees. The decision tree models built had the goal of predicting the number of visitors in 2022 using many different factors. When using just region and park type as predictors, park type seemed to be a better predictor compared to region. If a park is not a National Parkway, National River, National Lakeshore, National Seashore, National Recreation Area, National Park, or National Preserve and is not located in Alaska, it is predicted to have higher visitation numbers. In another tree, park type was also the deciding factor. When looking activities offered, that appeared to play a bigger role than region. Horse trekking and biking were two of the more influential activities when predicting visitor numbers. This could mean that the clusters found another pattern that didn’t relate to park type or region. Building the decision trees showed that park type and region can be somewhat helpful in predicting visitation.
What else can activities tell us? ARM proved that some of the less common activities are dependent on region. It was found that if a park was located in the Northeast, it was more likely to offer activities such as geocaching, live music, and golfing. Compare this to the Intermountain region, these parks are more likely to offer canyoneering, jet skiing, and rock climbing. Clustering also showed that activities such as mountain climbing and surfing are offered in parks that are visited more. Moving onto some supervised learning methods, we were able to learn a lot about the relationship between activities and the number of visitors in 2022. In Naive Bayes, the goal was to predict the number of activities a park offered based off of different combinations of predictors like the region, park type, visitors, acre data, and activity data. It was found that the acre data was not too useful in any supervised method, and had worse accuracy than when it was not included. Region, park type, and sometimes visitor numbers were better predictors for the number of activities. In Naive Bayes, just looking at region and park type was better then when visitor numbers were included in the model. For SVMs however, numerical data was needed, so visitor numbers were all that were used for predictors. It did decent job at predicting the number of activities, but the models that included park type and region had better accuracies. The conclusion from this is that region and park type are better are predicting the amount of activities compared to visitor numbers.
One main goal of the project was to see if there was a way to predict the park type based off of the data. Does the number of activities offered dictate what type of park something is? This was explored by doing Logistic Regression. For this model, the goal was to predict if something was a National Park or not. It turns out that using the visitor numbers in 2022 and the number of activities offered are really good at predicting if it is a National Park! The higher the visitors and the more activities offered, the more likely it is to be a National Park. If more time allowed, it would be interesting to build another model to predict a different park type, such as National Historic Parks. During Naive Bayes, National Historic Parks were found to offer a low amount of activities but surprisingly had a high amount of visitors. Given the different relationship between the predictors, it would be interesting to see if it was as accurate as predicting National Parks.
The main takeaways from this project are that region and park type are definitely helpful in predicting the number of visitors or number of activities offered. With so many different activities, it is hard to pinpoint what few activities are the best at determining visitor numbers. It is also hard to tell for sure if certain activities play a major part in visitor numbers at all. There are some activities that are very common across all parks, and there are some activities that seem to be more specific to specific regions. If a tourist was really interested in a specific activity, knowing this information could guide them to a certain region to focus on visiting. A next step would be to see if there are activities specific to certain park types. As stated in the introduction, doing this sort of analysis for NPS could be helpful in terms of preparing for visitors in certain regions or parks. Based on this analysis, National Parks tend to have more visitors and activities, which seem to require the most attention in terms of staffing and upkeep. The NPS website offers a lot of data, and only this project only scratched the surface. There are many other variables that could be looked at for each park in the system. It would have been helpful in this project if there were more numeric data. The acre data proved to not offer much insight, which only left the visitor statistics. If this project is continued it would be interesting to see if there is data available that shows the revenue brought in, the number of employees, the age of each park, or other things like that. I think having more numeric data would have improved some of the analysis by having more options for some of the supervised methods.