# === Load Packages === #
library(data.table)
library(dplyr) #mainly for %>% operatorLecture 2: Exercise Problems and Solutions
1 Exercise 1
Let’s use flights data, which is obtained from nycflights13. Run the following code to load the data.
flights <- data.table(nycflights13::flights)Find the flight company with the longest departure delay. (Hint: use
max()function to find the maximum value ofdep_delaycolumn)Subset the information of flights that headed to MSP (Minneapolis-St Paul International Airport) in February. Let’s name it “msp_feb_flights”. How many flights are there?
Calculate the median, interquartile range (\(IQR = Q3 − Q1\)) for
arr_delaysof flights in in themsp_feb_flightsdataset and the number of flights, grouped bycarrier. Which carrier has the most variable arrival delays?
IQR = Q3 − Q1 (the difference between the 75th percentile and the 25th percentile.) Use quantile() function to calculate the quantiles.
# === Part 1 === #
flights[dep_delay == max(dep_delay), .(carrier)]
# === Part 2 === #
msp_feb_flights <- flights[dest=="MSP" & month==2L]
nrow(msp_feb_flights)
# === Part 3 === #
msp_feb_flights[,.(
median = median(arr_delay),
IQR = quantile(arr_delay, 0.75) - quantile(arr_delay, 0.25),
n_flights = .N
), by = carrier]2 Exercise 2
We will continue to use the flights data from the previous exercise.
If you were selecting an airport simply based on on-time departure percentage, which NYC airport would you choose to fly out of? To address this question, first, define a new variable which indicates on-time departure. On-time-departure can be defined as a departure delay of less than or equal to 0. Then, calculate the on-time departure rate for each airport.
#| autorun: false
flights <- data.table(nycflights13::flights)
flights[, on_time := dep_delay <= 0] %>%
.[, .(on_time_rate = mean(on_time, na.rm = TRUE)), by = origin]
3 Exercise 3
For this exercise problem, we will use journal data from the AER package. First, load the data and convert it to data.table object using setDT function (or. as.data.table()). Take a look at the data. Also, type ?journal to see the description of the data.
#| autorun: false
# If you have not installed the package, run the following code
# install.packages("AER")
# load the package
library(AER)
# load the data from AER
data("Journals")
# To see the descriptions of the data,
# type `?Journals` in the console
?Journals
setDT(Journals)
- Calculate the average number of pages and submission delay for the entire dataset.
- Show the
title,citations,price, andsubscolumns for the top 5 journals (title) with the highest number of citations (citations). (Hint: useorder()function to sort the data bycitationsin descending order.) - This dataset is created in the year 2000. Calculate the age (
age) of each journal by subtracting the start year (foundingyear) of the journal from 2000. Select the columns,price,subs,citations, andpages, andage. Use that data to create a correlation matrix between those variables using thecor()function. (Hint: use this syntax:cor(data)). Can you find anything interesting from the correlation matrix?
# === Part 1 === #
Journals[, .(
avg_pages = mean(pages, na.rm = TRUE),
avg_delay = mean(submission_delay, na.rm = TRUE)
)]
# === Part 2 === #
Journals[order(-citations),] %>%
.[, .(title, citations, price, subs)] %>%
.[1:5]
# === Part 3 === #
Journals[, age := 2000 - foundingyear] %>%
.[, .(price, subs, citations, pages, age)] %>%
cor(.)