data.table is a package in R that provides an enhanced version of data.frame.
It is designed to be fast and memory efficient.
data.table vs dplyr
There is another package called dplyr that is also popular for data wrangling. But data.table is much faster than dplyr especially for large-scale data manipulation tasks.
See this for the speed comparison of dplyr and data.table.
This website compares dplyr vs data.table side by side.
If you already know dplyr syntax, this website would be helpful to understand data.table syntax.
Let’s use flights data, which is obtained from nycflights13.
To use the functionalities of the data.table package, we need to convert the data class to data.table class.
use setDT() function from data.table package to convert data.frame class to data.table class.
General data.table syntax
The general form of data.table syntax is
i: specifies which rows to be subsetted
j: specifies the operations on selected columns
by: specifies the variable to be used as groups by which operations specified in j are implemented
Simply put,
Take data.table (named DT), subset/reorder rows using i, then calculate j, grouped by by.
Note
You can do lots of tasks by combining i, j, and by arguments!
This is a stark difference from dplyr package, which requires you to use a specific function for a specific task.
Using data.table syntax, we will see how to:
subset rows
select columns, compute on the selected columns, create a new column
Example: Suppose we want to select dep_time column. Since we don’t subset rows, we leave i argument blank.
Note
If we wrap the variables (column names) within list(), which ensures that a data.table is returned.
.() is a shorthand for list() in data.table syntax.
Important: In data.table, each column is internally defiled as a list. As long as you use .() (or list()) in j expression, each element of the list will be converted to a column and results in the resulting data.table.
You can select multiple columns just like you did to select a single column.
3. Compute on Columns
Basics
data.table syntax: DT[i, j, by]
j not only allows you to select columns but also to compute on columns.
Example
Let’s count the number of trips which have had total delay < 0 (i.e., total day = dep_delay + arr_delay).
In this code, we used:
i: no action (all rows)
j: to take the sum of the logical vector arr_delay + dep_delay < 0
by: no action (no grouping)
Since, we skip the i expression, you need to put a comma before j expression.
Using i and j expressions together, you can perform calculations on the selected columns of the subsetted rows.
Example
How many flights were there with “JFK” as the origin airport in June?
In this code, we used:
i : to select rows where origin airport equals “JFK”, and month equals 6.
j : to count the number of rows in the subsetted data.
`by : no action (no grouping)
You can provide names to the value we calculated if you want.
Recall that .() is a shorthand for list() in data.table syntax (see this slide). You can name multiple elements like you name each element in the list object.
Example
How many flights were there with “JFK” as the origin airport in June, and what is the average departure delay?
Here is how you define multiple variables at the same time.
Caution
data.table’s := operator doesn’t allow referencing recently-added/changed variables within the same [-expression. To do what you want, you need another [.
Example
Let’s create new columns: (1) c by adding a and b, and (2) d by dividing c by a.
Using i and j expressions together, you can change the column values for rows that satisfy certain conditions.
Example:
Tip: copy() function
If you want to keep the original data, you can create a copy of the dataset using data.table::copy() function.
The object created by copy() is independent of the original dataset in the sense that actions on one of them do not affect the other.
For each month and each carrier, calculate the total number of flights, average departure delay, and average arrival delay.
(Optional) Define seasons (Winter: Dec-Feb, Spring: Mar-May, Summer: Jun-Aug, Fall: Sep-Nov) and summarize the total number of flights, average departure delay, and average arrival delay for each season and each carrier.
For each month and each carrier, calculate the total number of flights, average departure delay, and average arrival delay. .
(Optional) Define seasons (Winter: Dec-Feb, Spring: Mar-May, Summer: Jun-Aug, Fall: Sep-Nov) and summarize the total number of flights, average departure delay, and average arrival delay for each season and each carrier.
Note: I used fcase() function of data.table package to define seasons. fcase() is like case_when() in dplyr. It is useful when you want to define a variable that takes different values based on conditions.
Summary
Until now, we have learned the basic operations of the data.table package.
Don’t mind the details. Here are the things that I want you to know:
The general data.table syntax: DT[i, j, by]. i for rows, j for columns, and by for groups.
Any tasks related to rows can be done in i expression.
You can subset rows by specifying conditions in i expression.
Any tasks related to columns can be done in j expression.
You can select rows, compute on columns (don’t forget to use .()), and create new columns with :=.
Finally, you can perform grouped operations by specifying group in by argument.
With these skills, you can do many tasks related to data manipulation with the data.table package.
Here, each state (observational unit) appears only once in the first column, and their annual record of yield and rainfall are spread across the columns.
We can convert one format to another using dcast() and melt() functions of data.table package.
Let’s use the following small dataset to understand how dcast() and melt() work.
Use dcast() function converts long form to wide form
Basic Syntax:
LHS: set of id variables (variables (columns) that you don’t want change).
RHS: set of variables to be used as the column index.
value.var: set of variables whose values will be filled to cast.
Example:
Suppose that we want to collect all yield and rainfall observations corresponding to each state under the same row.
Tip
Before writing your code to reshape the data, it is important to imagine what your desired data format looks like.
I often sketch out an example.
This helps me to understand what variables I need to use as LHS, RHS, and value.var.
Use melt() function to convert wide form to long form
Basic Syntax:
id.vars: the set of id variables (variables (columns) that you don’t want change).
measure.vars: the set of columns you want to collapse (or combine) together.
value.name: (optional) the name of the new column that will store the values of the variables in measure.vars, the default is value
Example:
Let’s get back to the original data format yield_data_long from yield_data_wide.
Note however that year information from the variable names are lost. In the resulting dataset, variable == 1 and variable == 2 correspond to 2019 and 2020, respectively. So, you need an additional step to recover the original long data format.
In my personal experience, I often need to reshape data when:
1. I want to aggregate and summarize data efficiently.
Example: yield_data_long is easier to calculate the average yield and rainfall by state than yield_data_wide.
2. I want to visualize data efficiently.
We will see some examples when we learn data visualization with ggplot2 package.
It is very common that you have data stored in separate files, and you need to combine them before you conduct any statistical analysis.
Example
If you are interested in how weather affects crop yield, you want to have weather and production data in a single dataset. However, since weather and production data are usually obtained from a different source of data (e.g., yield data from USDA-NASS, and weather data from NASS), they are stored in two separate data files.
You can use the merge() function from the data.table package to merge two datasets.
Basic Syntax:
x, y: data tables.
by, which specifies variables that let you merge two datasets.
all.x = TRUE means that all rows from data1 are maintained in the merged dataset, and only matching rows from data2 are included (this is equivalent to left_join() in dplyr).
Let’s play around with the merge() function using the following small data.
Data 1
Data2
Note that states included in yield_data and weather_data are slightly different. This is to show how the merge() function works when there are unmatched rows in the two datasets.
To merge these two datasets, state works because we need to use state as a key column.
(1) merge weather_data to yield_data, keeping all rows from yield_data.
(2) merge yield_data to weather_data, keeping all rows from weather_data.
(3) If you want to keep all rows from both datasets, you can set all = TRUE.
In the flights data, the carrier column contains two-letter codes for airlines. Let’s translate these codes into the full name of the airline.
Airline data from nycflights13 package contains the full name of the airline corresponding to the two-letter code. The following code loads the airline data.
Merge flights and airlines data, keeping all rows from the flights data. Which variable should be used as a key column?
Obviously, flights data is the main data, so we should keep all rows from the flights data.
The key column should be carrier because it is the common variable in both datasets, and it gives one-to-one correspondence between the two datasets.
%>% a special symbol in R, called a pipe operator. It comes from the magrittr package.
It’s a powerful tool to write linear sequence of operations in a more readable way.
Note
When you load the dplyr package, magrittr package is automatically loaded as well. So, you don’t need to load the magrittr package separately to use %>%.
%>% takes the output of the code on its left and feeds it as the first argument to the function on its right.
Example 1
fun1(input1)
is the same as
input1 %>%fun1()
Example 2
output1 <-fun1(input1)output2 <-fun2(output1)
is the same as
output2 <-fun1(input1) %>%fun2()
Shortcut for the pipe operator %>%
In RStudio, hit Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac)
Suppose you have more than one argument for the function like this:
fun(x1, x2, x3)
Then
z %>%fun(x2, x3)
is equivalent to
fun(z, x2, x3)
Important
That is, by default, an R object that precedes the piping operator (%>%) becomes the first argument of the function that comes after the piping operator.
What if you want to use the object defined before %>% as the second or third argument of the subsequent function?
You can refer the preceding object by . in the subsequent function.
Example
Tip
Whenever you use %>%, I recommend you always use . in the subsequent function to explicitly denote the destination of the object defined before %>% even if it is the first argument.
Example
Without %>%
With %>%
Important
Note that the order of execution is the same as the order in which the functions are written.
Summary:
The topics in the second part of this lecture were a bit advanced. You don’t need to remember the syntax of all functions we learned today.
For example, I still cannot remember which function (decast() or melt()) reshapes the data from wide to long. I always need to google it.
But, I want you to know the following things at least.
Key Takeaways
You can reshape data with dcast() and melt() functions. Depending on the situation, the wide or long format may be more suitable for data analysis.
You can merge data with merge() function. To do so, you need to have a common key column in both datasets.
You don’t need to use %>%operator, but it would be convenient if you know how to use it.
Find the flight company with the longest departure delay. (Hint: use max() function to find the maximum value of dep_delay column)
Subset the information of flights that headed to MSP (Minneapolis-St Paul International Airport) in February. Let’s name it “msp_feb_flights”. How many flights are there?
Calculate the median, interquartile range (\(IQR = Q3 − Q1\)) for arr_delays of flights in in the msp_feb_flights dataset and the number of flights, grouped by carrier. Which carrier has the most variable arrival delays?
Hints
IQR = Q3 − Q1 (the difference between the 75th percentile and the 25th percentile.) Use quantile() function to calculate the quantiles.
If you were selecting an airport simply based on on-time departure percentage, which NYC airport would you choose to fly out of? To address this question, first, define a new variable which indicates on-time departure. On-time-departure can be defined as a departure delay of less than or equal to 0. Then, calculate the on-time departure rate for each airport.
For this exercise problem, we will use journal data from the AER package. First, load the data and convert it to data.table object using setDT function (or. as.data.table()). Take a look at the data. Also, type ?journal to see the description of the data.
Calculate the average number of pages and submission delay for the entire dataset.
Show the title, citations, price, and subs columns for the top 5 journals (title) with the highest number of citations (citations). (Hint: use order() function to sort the data by citations in descending order.)
This dataset is created in the year 2000. Calculate the age (age) of each journal by subtracting the start year (foundingyear) of the journal from 2000. Select the columns, price, subs, citations, and pages, and age. Use that data to create a correlation matrix between those variables using the cor() function. (Hint: use this syntax: cor(data)). Can you find anything interesting from the correlation matrix?