Day 2: Data wrangling with data.table

Department of Applied Economics, University of Minnesota

Shunkei Kakimoto

Learning Objectives

    1. To be able to use the basic data wrangling skills with the data.table package:
    1. Learn how to use %>% operator of the magrittr package.


Note

  • %>% operator is not an essential topic, but it is useful if you know it.


Reference

Today’s outline:

  1. Data manipulation with data.table
  2. %>% operator
  3. Exercise Problems
  4. Appendix

Introduction to data.table

What is data.table?

  • data.table is a package in R that provides an enhanced version of data.frame.
    • It is designed to be fast and memory efficient.


data.table vs dplyr

There is another package called dplyr that is also popular for data wrangling. But data.table is much faster than dplyr especially for large-scale data manipulation tasks.

  • See this for the speed comparison of dplyr and data.table.
  • This website compares dplyr vs data.table side by side.
    • If you already know dplyr syntax, this website would be helpful to understand data.table syntax.
  1. Let’s use flights data, which is obtained from nycflights13.


  1. To use the functionalities of the data.table package, we need to convert the data class to data.table class.
  • use setDT() function from data.table package to convert data.frame class to data.table class.

General data.table syntax

The general form of data.table syntax is

  • i: specifies which rows to be subsetted
  • j: specifies the operations on selected columns
  • by: specifies the variable to be used as groups by which operations specified in j are implemented


Simply put,

Take data.table (named DT), subset/reorder rows using i, then calculate j, grouped by by.


Note

  • You can do lots of tasks by combining i, j, and by arguments!
  • This is a stark difference from dplyr package, which requires you to use a specific function for a specific task.

Using data.table syntax, we will see how to:

  • subset rows
  • select columns, compute on the selected columns, create a new column
  • perform aggregations by group

1. Subset Rows

  • data.table syntax: DT[i, j, by]
  • To subset rows, use condition in i expression. (e.g., DT[colA == condition1,])

Example

Let’s subset rows where carrier is “AA” (American Airlines)


In this code, we used:

  • i: to subset rows with origin == "JFK" & month == 6L
  • j: no action (all columns)
  • by: no action (no grouping)

Please Note

The important thing to remember is that you do any tasks related to rows in i.


Example

Sort flights first by month in ascending order, and then by day in descending order:

2. Select Columns

  • data.table syntax: DT[i, j, by]
  • To select columns, use j argument.

Example: Suppose we want to select dep_time column. Since we don’t subset rows, we leave i argument blank.


Note

  • If we wrap the variables (column names) within list(), which ensures that a data.table is returned.

  • .() is a shorthand for list() in data.table syntax.

  • Important: In data.table, each column is internally defiled as a list. As long as you use .() (or list()) in j expression, each element of the list will be converted to a column and results in the resulting data.table.


You can select multiple columns just like you did to select a single column.

3. Compute on Columns


Basics

  • data.table syntax: DT[i, j, by]
  • j not only allows you to select columns but also to compute on columns.


Example

Let’s count the number of trips which have had total delay < 0 (i.e., total day = dep_delay + arr_delay).

In this code, we used:

  • i: no action (all rows)
  • j: to take the sum of the logical vector arr_delay + dep_delay < 0
  • by: no action (no grouping)

Since, we skip the i expression, you need to put a comma before j expression.

3. Compute on Columns of the Subsetted Rows

  • data.table syntax: DT[i, j, by]
  • Using i and j expressions together, you can perform calculations on the selected columns of the subsetted rows.


Example

How many flights were there with “JFK” as the origin airport in June?


In this code, we used:

  • i : to select rows where origin airport equals “JFK”, and month equals 6.
  • j : to count the number of rows in the subsetted data.
  • `by : no action (no grouping)
  • You can provide names to the value we calculated if you want.
  • Recall that .() is a shorthand for list() in data.table syntax (see this slide). You can name multiple elements like you name each element in the list object.


Example

How many flights were there with “JFK” as the origin airport in June, and what is the average departure delay?

  1. Calculate the average arrival and departure delays for all flights with “JFK” as the origin airport in the month of August.
  • Hint:
    • you only need to use the following variables: origin, month, arr_delay, dep_delay
    • to calculate the average, use mean() function.


  1. Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of August.

4. Create a New Column

  • data.table syntax: DT[i, j, by]
  • In j expression, you can add a new column to an existing data table using := operator.
    • Here, : represents the fixed values and = represents the assignment of values. So, they together represent the assignment of fixed values.


Syntax

# === Add one column === #
DT[, "new_column_name" := .(valueA)]

# or you can drop the quotes and `.()` for convenience
DT[, new_column_name := valueA]

Using the following simple example dataset, let’s create another column c by adding a and b column.


Important Rule:

:= creates new columns by updating the data in-place (by reference). Thus, the original data is altered.

4. Create a New Column (continued)

Here is how you define multiple variables at the same time.

Caution

  • data.table’s := operator doesn’t allow referencing recently-added/changed variables within the same [-expression. To do what you want, you need another [.

Example

  • Let’s create new columns: (1) c by adding a and b, and (2) d by dividing c by a.

Using i and j expressions together, you can change the column values for rows that satisfy certain conditions.


Example:


Tip: copy() function

  • If you want to keep the original data, you can create a copy of the dataset using data.table::copy() function.

  • The object created by copy() is independent of the original dataset in the sense that actions on one of them do not affect the other.

Create two new columns in the flights data:

  • total_delay:the sum of dep_delay and arr_delay.
  • speed: the ratio of distance to air_time (i.e, distance/air_time.)

5. Perform Aggregations by Group (Grouped Operations)

  • data.table syntax: DT[i, j, by]
  • To perform grouped operations, use by argument.


Example: Let’s find the number of flights by origin.


In this code, we used

  • i: no action (all rows)
  • j: count the number of rows in each group defined by by argument
  • by: group the data by origin

Nothing special. Just provide multiple columns to by argument.


Example: Find the average time of departure delay and arrival delay by carrier and origin.

Together with i argument, you can perform grouped operations for select observations.


Example 1: Get the number of flights for each origin airport for carrier code “AA” (American Airlines).

Note

  • i: subset rows where carrier is “AA”
  • j: count the number of rows in each group defined by by argument
  • by: group the data by origin


Example 2: Find the number of flights by origin and month for carrier code “AA” (American Airlines).

  1. For each month and each carrier, calculate the total number of flights, average departure delay, and average arrival delay.


  1. (Optional) Define seasons (Winter: Dec-Feb, Spring: Mar-May, Summer: Jun-Aug, Fall: Sep-Nov) and summarize the total number of flights, average departure delay, and average arrival delay for each season and each carrier.
  1. For each month and each carrier, calculate the total number of flights, average departure delay, and average arrival delay. .


  1. (Optional) Define seasons (Winter: Dec-Feb, Spring: Mar-May, Summer: Jun-Aug, Fall: Sep-Nov) and summarize the total number of flights, average departure delay, and average arrival delay for each season and each carrier.

Note: I used fcase() function of data.table package to define seasons. fcase() is like case_when() in dplyr. It is useful when you want to define a variable that takes different values based on conditions.

Summary

Until now, we have learned the basic operations of the data.table package.

Don’t mind the details. Here are the things that I want you to know:

  • The general data.table syntax: DT[i, j, by]. i for rows, j for columns, and by for groups.
  • Any tasks related to rows can be done in i expression.
    • You can subset rows by specifying conditions in i expression.
  • Any tasks related to columns can be done in j expression.
    • You can select rows, compute on columns (don’t forget to use .()), and create new columns with :=.
  • Finally, you can perform grouped operations by specifying group in by argument.


With these skills, you can do many tasks related to data manipulation with the data.table package.

Next, we will see a few advanced topics:

  • Reshaping Data
  • Merging Multiple Datasets
  • (and the %>% operator if we have time).

6. Reshape Data

Data often comes in two formats: wide or long.

Example:

Long data:

Here, each state (observational unit) is repeated several times in the first column.

      state  year yield rainfall
     <char> <num> <num>    <num>
1:   Kansas  2019   200       14
2:   Kansas  2020   240       15
3: Nebraska  2019   210       15
4: Nebraska  2020   220       16
5:     Iowa  2019   220       20
6:     Iowa  2020   230       21
7: Illinois  2019   190       24
8: Illinois  2020   150       15

Wide data

Here, each state (observational unit) appears only once in the first column, and their annual record of yield and rainfall are spread across the columns.

Key: <state>
      state yield_2019 yield_2020 rainfall_2019 rainfall_2020
     <char>      <num>      <num>         <num>         <num>
1: Illinois        190        150            24            15
2:     Iowa        220        230            20            21
3:   Kansas        200        240            14            15
4: Nebraska        210        220            15            16


  • We can convert one format to another using dcast() and melt() functions of data.table package.

Let’s use the following small dataset to understand how dcast() and melt() work.

  • Use dcast() function converts long form to wide form

Basic Syntax:

  • LHS: set of id variables (variables (columns) that you don’t want change).
  • RHS: set of variables to be used as the column index.
  • value.var: set of variables whose values will be filled to cast.


Example:

Suppose that we want to collect all yield and rainfall observations corresponding to each state under the same row.


Tip

  • Before writing your code to reshape the data, it is important to imagine what your desired data format looks like.
  • I often sketch out an example.
    • This helps me to understand what variables I need to use as LHS, RHS, and value.var.
  • Use melt() function to convert wide form to long form

Basic Syntax:

  • id.vars: the set of id variables (variables (columns) that you don’t want change).
  • measure.vars: the set of columns you want to collapse (or combine) together.
  • value.name: (optional) the name of the new column that will store the values of the variables in measure.vars, the default is value


Example:

Let’s get back to the original data format yield_data_long from yield_data_wide.

Note however that year information from the variable names are lost. In the resulting dataset, variable == 1 and variable == 2 correspond to 2019 and 2020, respectively. So, you need an additional step to recover the original long data format.

In my personal experience, I often need to reshape data when:

1. I want to aggregate and summarize data efficiently.

  • Example: yield_data_long is easier to calculate the average yield and rainfall by state than yield_data_wide.


2. I want to visualize data efficiently.

  • We will see some examples when we learn data visualization with ggplot2 package.

Using the following long-form data named long_dt, can you get yield_data_long?


7. Merge Multiple Datasets

It is very common that you have data stored in separate files, and you need to combine them before you conduct any statistical analysis.


Example

If you are interested in how weather affects crop yield, you want to have weather and production data in a single dataset. However, since weather and production data are usually obtained from a different source of data (e.g., yield data from USDA-NASS, and weather data from NASS), they are stored in two separate data files.


You can use the merge() function from the data.table package to merge two datasets.

Basic Syntax:

  • x, y: data tables.
  • by, which specifies variables that let you merge two datasets.
  • all.x = TRUE means that all rows from data1 are maintained in the merged dataset, and only matching rows from data2 are included (this is equivalent to left_join() in dplyr).

Note: The order of the datasets matter.

Let’s play around with the merge() function using the following small data.

Data 1

Data2

  • Note that states included in yield_data and weather_data are slightly different. This is to show how the merge() function works when there are unmatched rows in the two datasets.
  • To merge these two datasets, state works because we need to use state as a key column.

(1) merge weather_data to yield_data, keeping all rows from yield_data.


(2) merge yield_data to weather_data, keeping all rows from weather_data.


(3) If you want to keep all rows from both datasets, you can set all = TRUE.

  1. In the flights data, the carrier column contains two-letter codes for airlines. Let’s translate these codes into the full name of the airline.

Airline data from nycflights13 package contains the full name of the airline corresponding to the two-letter code. The following code loads the airline data.


Merge flights and airlines data, keeping all rows from the flights data. Which variable should be used as a key column?

  • Obviously, flights data is the main data, so we should keep all rows from the flights data.

  • The key column should be carrier because it is the common variable in both datasets, and it gives one-to-one correspondence between the two datasets.

Run the following code to create two datasets: yield_data and weather_data.


Merge these two datasets, keeping all rows from yield_data. Which variable should be used as a key column?

  • Here you should use state and year as key columns.

%>% operator

Motivation

  • In R, you need to assign the result of each operation to a new object if you want to use the result in the subsequent process.
  • But sometimes, some objects are just intermediate results that you don’t need to keep.

Example

Let’s create flights_mini data from flights data of nycflights13 package in the data.table format. Look at the following code:

The first three lines yield intermediate results to make the final flight_mini, and you don’t need to keep those.


You can create flights_mini without using those intermediate steps with the chaining operation in data.table package, but it’s hard to read!

Introduction

  • %>% a special symbol in R, called a pipe operator. It comes from the magrittr package.
  • It’s a powerful tool to write linear sequence of operations in a more readable way.


Note

When you load the dplyr package, magrittr package is automatically loaded as well. So, you don’t need to load the magrittr package separately to use %>%.

%>% takes the output of the code on its left and feeds it as the first argument to the function on its right.


Example 1

fun1(input1) 

is the same as

input1 %>% fun1()


Example 2

output1 <- fun1(input1)
output2 <- fun2(output1)

is the same as

output2 <- fun1(input1) %>% fun2()


Shortcut for the pipe operator %>%

  • In RStudio, hit Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)

Suppose you have more than one argument for the function like this:

fun(x1, x2, x3)


Then

z %>% fun(x2, x3)

is equivalent to

fun(z, x2, x3)


Important

That is, by default, an R object that precedes the piping operator (%>%) becomes the first argument of the function that comes after the piping operator.

What if you want to use the object defined before %>% as the second or third argument of the subsequent function?

You can refer the preceding object by . in the subsequent function.

Example

Tip

  • Whenever you use %>%, I recommend you always use . in the subsequent function to explicitly denote the destination of the object defined before %>% even if it is the first argument.

Example

Without %>%

With %>%


Important

Note that the order of execution is the same as the order in which the functions are written.

Summary:

The topics in the second part of this lecture were a bit advanced. You don’t need to remember the syntax of all functions we learned today.

  • For example, I still cannot remember which function (decast() or melt()) reshapes the data from wide to long. I always need to google it.

But, I want you to know the following things at least.


Key Takeaways

  1. You can reshape data with dcast() and melt() functions. Depending on the situation, the wide or long format may be more suitable for data analysis.

  2. You can merge data with merge() function. To do so, you need to have a common key column in both datasets.

You don’t need to use %>%operator, but it would be convenient if you know how to use it.

Exercise Problems

Exercise 1

  1. Find the flight company with the longest departure delay. (Hint: use max() function to find the maximum value of dep_delay column)

  2. Subset the information of flights that headed to MSP (Minneapolis-St Paul International Airport) in February. Let’s name it “msp_feb_flights”. How many flights are there?

  3. Calculate the median, interquartile range (\(IQR = Q3 − Q1\)) for arr_delays of flights in in the msp_feb_flights dataset and the number of flights, grouped by carrier. Which carrier has the most variable arrival delays?

Hints

IQR = Q3 − Q1 (the difference between the 75th percentile and the 25th percentile.) Use quantile() function to calculate the quantiles.

Exercise 2

If you were selecting an airport simply based on on-time departure percentage, which NYC airport would you choose to fly out of? To address this question, first, define a new variable which indicates on-time departure. On-time-departure can be defined as a departure delay of less than or equal to 0. Then, calculate the on-time departure rate for each airport.


Exercise 3

For this exercise problem, we will use journal data from the AER package. First, load the data and convert it to data.table object using setDT function (or. as.data.table()). Take a look at the data. Also, type ?journal to see the description of the data.


  1. Calculate the average number of pages and submission delay for the entire dataset.
  2. Show the title, citations, price, and subs columns for the top 5 journals (title) with the highest number of citations (citations). (Hint: use order() function to sort the data by citations in descending order.)
  3. This dataset is created in the year 2000. Calculate the age (age) of each journal by subtracting the start year (foundingyear) of the journal from 2000. Select the columns, price, subs, citations, and pages, and age. Use that data to create a correlation matrix between those variables using the cor() function. (Hint: use this syntax: cor(data)). Can you find anything interesting from the correlation matrix?

Appendix

Useful functions

  • .N
  • copy()
  • setnames()
  • order()
  • shift()
  • duplicated(): find duplicates
  • unique(): find unique observations
  • fcase() is like case_when() in dplyr.

fcase()

  • fcase() is like case_when() in dplyr. It is useful when you want to define a variable that takes different values based on conditions.

  • fcase() function returns the first value for which the corresponding condition is TRUE. If no condition is TRUE, it returns the default value.


Example: Define seasons (Winter: Dec-Feb, Spring: Mar-May, Summer: Jun-Aug, Fall: Sep-Nov)