Lecture 2: Exercise Problems and Solutions

# === Load Packages === #
library(data.table)
library(dplyr) #mainly for %>% operator

1 Exercise 1

Let’s use flights data, which is obtained from nycflights13. Run the following code to load the data.

flights <- data.table(nycflights13::flights)

Find the flight company with the longest departure delay. (Hint: use max() function to find the maximum value of dep_delay column)
Subset the information of flights that headed to MSP (Minneapolis-St Paul International Airport) in February. Let’s name it “msp_feb_flights”. How many flights are there?
Calculate the median, interquartile range (\(IQR = Q3 − Q1\)) for arr_delays of flights in in the msp_feb_flights dataset and the number of flights, grouped by carrier. Which carrier has the most variable arrival delays?

Hints

IQR = Q3 − Q1 (the difference between the 75th percentile and the 25th percentile.) Use quantile() function to calculate the quantiles.

# === Part 1 === #
flights[dep_delay == max(dep_delay), .(carrier)]

# === Part 2 === #
msp_feb_flights <- flights[dest=="MSP" & month==2L]
nrow(msp_feb_flights)

# === Part 3 === #
msp_feb_flights[,.(
  median = median(arr_delay),
  IQR = quantile(arr_delay, 0.75) - quantile(arr_delay, 0.25),
  n_flights = .N
  ), by = carrier]

2 Exercise 2

We will continue to use the flights data from the previous exercise.

If you were selecting an airport simply based on on-time departure percentage, which NYC airport would you choose to fly out of? To address this question, first, define a new variable which indicates on-time departure. On-time-departure can be defined as a departure delay of less than or equal to 0. Then, calculate the on-time departure rate for each airport.

#| autorun: false
flights <- data.table(nycflights13::flights)

flights[, on_time := dep_delay <= 0] %>%
  .[, .(on_time_rate = mean(on_time, na.rm = TRUE)), by = origin]

3 Exercise 3

For this exercise problem, we will use journal data from the AER package. First, load the data and convert it to data.table object using setDT function (or. as.data.table()). Take a look at the data. Also, type ?journal to see the description of the data.

#| autorun: false
# If you have not installed the package, run the following code
# install.packages("AER")

# load the package
library(AER)
# load the data from AER
data("Journals")

# To see the descriptions of the data, 
# type `?Journals` in the console
?Journals

setDT(Journals)

Calculate the average number of pages and submission delay for the entire dataset.
Show the title, citations, price, and subs columns for the top 5 journals (title) with the highest number of citations (citations). (Hint: use order() function to sort the data by citations in descending order.)
This dataset is created in the year 2000. Calculate the age (age) of each journal by subtracting the start year (foundingyear) of the journal from 2000. Select the columns, price, subs, citations, and pages, and age. Use that data to create a correlation matrix between those variables using the cor() function. (Hint: use this syntax: cor(data)). Can you find anything interesting from the correlation matrix?

# === Part 1 === #
Journals[, .(
  avg_pages = mean(pages, na.rm = TRUE),
  avg_delay = mean(submission_delay, na.rm = TRUE)
)]

# === Part 2 === #
Journals[order(-citations),] %>%
  .[, .(title, citations, price, subs)] %>%
  .[1:5]
  
# === Part 3 === #
Journals[, age := 2000 - foundingyear] %>%
  .[, .(price, subs, citations, pages, age)] %>%
  cor(.)