Time Performance Data R Studio Project
Description
Using an R Notebook to produce your solutions to the following questions. Make sure to interpret your answers.
create a similar dataset to the already available nycflights13 dataset for the Bay Area. The task is to find the original website for the Airline On-Time Performance Data and create a dataset called baflights20.
Questions to answer:
These questions should be answered in order, in your R Notebook.
- The US Government website Airline On-Time Performance Data is where the data can be downloaded. What government agency hosts this website and how can you download the data? Download the data for the Bay Area airports, SFO, OAK, SJC available months in 2020. Try to download the same columns as in the flights data frame in nycflights13. Can you do this? If not, what can you download? What must be done to produce the same variables and data for the Bay Area airports?
- Now use the anyflights R package (also available through CRAN) to create the same data frames in the nycflights13 data set, but for the Bay Area airports in 2020. Name the dataset baflights20. Currently the anyflights package has some open issues on Github, so the function to download the data does not work on all platforms (this might have been fixed in version 0.3.1). Run the fs::dir_ls(“data”) command to see that the files are in the data subdirectory.
- Once you have your data downloaded, develop your code for the first month of data. The last step will be to include all of the data and perform an overall analysis for 2020. The data includes all flights that departed from the Bay Area, including all flights departing from San Francisco (SFO), Oakland (OAK), and San Jose (SJC). How many departing flights were there in January 2020? How many departing flights were there from each airport in January 2020?
- Compare the variables that are available in the baflights20 flights data frame with the variables in the nycflights13 data frame. Make a table of the variables that are in both datasets, with a description of each variable (an abbreviated codebook). Hint: In RStudio see Help > RMarkdown Quick Reference > Tables. Report any differences in the variables.
5) What month had the highest proportion of cancelled flights? What month had the lowest? Interpret any seasonal pattern
6) What plane traveled the most times from Bay Area airports in 2013? Plot the number of trips over the year
7) What is the oldest plane (specified by talinum variable) that flew to Bay Area Airports? How many planes that flew from Bay Area were included in the planes table?
Answer the following questions for the Bay Area 2020 data. (Most of the questions asked cannot be answered due to the lack of data.)
8) What is the Distribution of temperature in July? Identify any important outliers in value of the wind_speed variable. What is the relation between dewp and humid? What is the relationship between precip and visib? (There should be enough data to look for outliers in the wind_speed variable.)
9) How many planes have a missing date of manufacture? What are the five common manufacturers? Has the disturbution of manufacturer changed over times. ( there should be enough data to look at visib over the months)