DataPrep_EDA - Isabel Beaulieu

All code and raw data downloaded for this project can be found here : https://drive.google.com/drive/folders/1dihNrMlG-SF1ms2_IHsZ87tafh1Xr0hS?usp=sharing

The first place to gather data from was the National Park Service (NPS) website. An API was used as well as a downloaded a pre-made table. This website was chosen because it’s API provides an endless amount of information about not only national parks, but other parks as well (such as national memorials, national monuments, etc.). Data on the park designation of each park, the state(s) it is in, the activities offered, and it’s description was available. The data gathered from this source will be used to find similarities between parks and identify if there is a relationship with visitation numbers and other factors. First, here is the data gathered from the API.

NPS API:

On the API, the endpoint used was: “https://developer.nps.gov/api/v1/parks”. In addition to the API key, the limit parameter was changed. Figure 1 shows the code used to get the data.

The first thing done with this data was making a file that contained the park name, designation, states, and activities offered in the park. Figure 2 is a snippet of the raw data, and Figure 3 shows the cleaned dataset. The best way to keep a record of each activity a park offered was to make each possible activity a column that is a factor variable. If the park contains that activity, it will be marked with a ‘1’ and a ‘0’ if it does not have that activity. The resulting CSV in Figure 3 also contains a column called ‘Region’, obtained from the annual visits table downloaded from the NPS dataset in Figure 5.

Figure 3: Cleaned data from NPS API that includes the park, park type, states, activities offered and region of the park

In addition to making a file with the different activities each park offers, it also will be interesting to look at park descriptions. A count vectorizer was used to create a bag of words for the description of each park. The raw data is the same as Figure 2 and a snapshot of the clean data is shown in Figure 4.

Figure 4: Cleaned dataset of park description bag of words. There also is a column for ‘Park Type’ that isn’t shown.

NPS Annual Visits Table:

Since this project hopes to answer questions related to park visits, a reliable source of number of visitors for each park was needed. The NPS had the best data. The table was downloaded from this website: https://irma.nps.gov/Stats/SSRSReports/National%20Reports/Annual%20Visitation%20and%20Record%20Year%20by%20Park%20(1904%20-%20Last%20Calendar%20Year). For the parameters, all regions, all parks, and all park types were selected. The raw data is shown in Figure 5. This data required a lot of cleaning. A lot of the years had missing values (since a lot of the parks didn’t exist until more recently). Based off the percentage of missing data in each year column (only columns that had 95% or more data were kept), only data from the years 2016-2022 were kept. For missing values, all rows that had any missing values were removed. This was decided as the best option because when looking at the parks that had missing values, they weren’t that popular (had low visitor numbers for other years) and there is enough data without them.

There were also inconsistencies in this dataset. For example, in the second row of Figure 5, the park name is “Aniakchak NM & PRES” and in Figure 3 the same park was named “Aniakchak National Monument & Preserve”. In order to merge these two datasets, abbreviations had to be corrected and replaced. Commas in the numbers were removed. For each year, there were no outlier values, but there were some parks that had ‘0’ as a number of visitors for one year while having a significant number of visitors for the years before and after. That makes it seem like ‘0’ is an incorrect value. All parks that had a value of ‘0’ for a number for a year was removed (only 5 parks had to be removed). The cleaned data is shown in Figure 6.

Figure 5: Snippet of NPS visitation data

After gathering all of the data, a final CSV was creating that includes region, park name, park name, park type, visitor numbers from 2016-2022, states, and the activities offered. The final dataset can be see in Figure 7.

Figure 7: Cleaned data with all information from visitor data and NPS API

NPS Data EDA:

After gathering all of the data from the NPS website, some beginning analysis was done to see what the data contained. Looking at Figure 8, the activity that is offered in the most parks is Junior Ranger Program, available in a little over 80% of parks. When looking at which parks offer the most activities, Figure 9 shows that Great Sand Dunes National Park & Preserve offers the most, offering a little over 50 activities. The graph is also colored by region, showing that 4 out of the top 10 parks with the most activities are in the intermountain region. When getting data from the API, there were more parks represented than compared to the annual visitation data. When merging the region aspect of the annual visitation data frame to the activity data, the parks that weren’t included in the visitation dataset would not get a region. All parks without a region were not included in the final dataset shown in Figure 7, just to make it easier to do analysis. A lot of the parks that didn’t have regions were not as popular, so not too much data was lost. A copy of all parks and activities without regions is stored if more data is needed when doing analysis. Figure 10 shows the what percentage of the dataset each region takes up. We can see that about a quarter of the parks are in the intermountain region, followed by the northeast region having a fifth of the parks.

Figure 9: Top 10 parks with the most activities. Colored by region

Figure 10: Pie chart of regions in the dataset

Moving onto the annual visits data, there are some interesting findings. Figure 11 shows the top 10 most visited parks in 2022, with the Blue Ridge Parkway coming in on top followed by Golden Gate National Recreation Area. 2021 was also looked at to see if the same parks were in the top 10 as in 2022, shown in Figure 12. Blue Ridge Parkway remains the most visited, and the following two parks are the same but in different order. Grand canyon was top visited in 2022 but not in 2021.

Figure 11: Top 10 most visited parks in 2022

Figure 12: Top 10 most visited parks in 2021

To focus on specific park types and the visitation statistics a frequency table of the top 5 park types was made, shown in Figure 13. Let’s look at national monument visitor statistics since there are more of these types of parks compared to the rest, and also National Parks since that is the main focus of this project. When looking at the top national monuments in 2022 in Figure 14, Castle Clinton National Monument was the most visited followed by Muir Woods. Looking at the top national parks in 2022 in Figure 15, Great Smoky Mountain took the lead by a lot. The remaining top parks all had very similar visitation statistics. It will be interesting to see when doing further analysis if the top parks share any characteristics, or if Great Smoky Mountain has different characteristics than the other top parks.

National Monument	64
National Historic Site	64
National Park	54
National Historic Park	43
National Memorial	16

Figure 13: Top 5 park types

Figure 14: Top 5 most visited National Monuments in 2022

Figure 15: Top 5 most visited National Parks in 2022

Since Great Smoky Mountain was the most visited in 2022, let’s look at visitor trends from 2016-2022. Figure 16 shows that there was an increase of visitors from 2016-2019. There were less visitors in 2020 compares to 2019, but continued to increase from 2020-2021. There were less visitors in 2022 compared to 2022, but still more than there was compared to pre COVID-19 levels.

Figure 16: Visitor trends from 2016-2022 for Great Smoky Mountains NP

NPS also provides data on park acreage, which could be interesting to look at to see if park size has any impact on number of visitors, or types of activities offered. The data was downloaded from https://www.nps.gov/subjects/lwcf/acreagereports.htm and the report downloaded was the Quarterly Acreage Report from December 31, 2022. A snapshot of the raw data is shown in Figure 17. To clean the data to be prepared to merge with the data in Figure 7, the ‘Area Name’ column had to be renamed to ‘Park Name’ and abbreviations had to be corrected as well as correcting the uppercase nature of the ‘Area Name’ column. The Region and State columns were deleted since they appear in the table that it was going to be merged with. After cleaning the data and doing an inner join, the dataset was saved as a CSV.

After the merge, this dataset only had 217 rows, compared to 299 rows in Figure 7, so instead of altering the dataset made earlier, a new one was made. This contains all of the data that appears in Figure 7 as well as the data from the acre dataset. The final product is seen in Figure 18. While cleaning this data, commas had to be removed from the numbers. It was also important to check that the sum of the first 3 columns were equal to ‘Subtotal Federal Acres’, that ‘Other Public Acres’ and ‘Private Acres’ summed to ‘Subtotal Non-Federal Acres’, and that ‘Subtotal Federal Acres’ and ‘Subtotal Non-Federal Acres’ summed to ‘Gross Acre Area’. Everything was consistent, and there were no outliers or incorrect values detected.

Air Quality Data:

Another goal of this project is to see if the amount of people visiting parks has an impact on air quality. There isn’t one source that has air quality data for every national park, but the U.S. Department of the Interior tracks air quality metrics at certain national parks. It would be too time consuming to download separate datasets for every national park they collect data on, so the focus will be on the parks in Figure 15 since they were the most visited in 2022 (Great Smoky Mountains, Grand Canyon, Zion, Rocky Mountain & Acadia). The link to access the data is: https://ard-request.air-resource.com/data.aspx. The sites that data files were downloaded for are: ACAD-MH, GRCA-AS, GRSM-CM, ZION-DM, and ROMO-LP. The dates used were from 1/1/2016 – 12/31/2022. The parameter select was hourly Ozone. Separate files were downloaded for each of the five sites, looking identical to Figure 19. After deleting some rows in the Numbers App on my MacBook (all rows above DATE_TIME and ZION-DW_O3_PPB since they don’t contain any relevant information) and dealing with incorrect values in python, the cleaned data for Zion is shown in Figure 20, following the same procedure for all sites.

To identify incorrect values, summary statistics were produced. It appeared that there were a hand full of rows with a value of “-999” for ozone concentration. Knowing that ozone concentration cannot be below 0, all rows with a value of -999 were considered incorrect. When looking at the variance of the column with and without the incorrect values, the variance was large. That being said, the best option was to remove all rows with a value of -999. For Zion, 98.9% of the data was retained after removing the incorrect values and there were still over 60,000 time stamps. After removing the incorrect values, summary statistics for each park were looked at. No incorrect values or outliers were detected. An example of the yearly ozone average for Zion National Park is shown in Figure 21. Is air quality in Zion impacted by the number of visitors the park had in that year or the previous year? Figure 22 shows the number of visitors each year in Zion. It is hard to tell if air quality is impacted by visitors, the graphs seem to follow the same general trend, but a regression model might be a better way of deciding if number of visitors impacts air quality.