Day 2: Data Wrangling with the `data.table` package

The R package data.table is an extension of the data.frame class and is equipped with lots of additional powerful functionalities designed to handle large datasets. It is built for speed and efficiency. Unlike functions of the dplyr package (another popular R package for data wrangling), in which you need to use a specific function for each task, the syntax of data.table is so simple and consistent that you can do almost everything with just a few commands.

In this lecture, we will learn the basic data wrangling skills with the data.table package. In addition, I will also briefly introduce the %>% operator of the magrittr package as a tool to make your code more concise and readable.

Learning Objectives

To be able to use the basic data wrangling skills with the data.table package:
- subset rows
- select and compute on columns
- rename columns
- perform aggregations by group
- merge multiple datasets
- reshape wide-to-long and long-to-wide, respectively
to be able to use %>% operator of the magrittr package.

Preparation

This is the official website for the data.table package. Specifically, I recommend you take a look at the Introduction section.
Section 18: Pipes in the book of R for Data Science is a good introduction to the %>% operator of the magrittr package.

Lecture Slides

Click here for Lecture 2’s slides.

Exercise problems: Exercise problems for Lecture 2. Solution is attached.

Quick view of the slides:

Supplementary Materials

data.table homepage.

Using .SD for Data Analysis (Although we will not cover this topic in the class, it would be useful if you know how to use .SD in data.table.)

The data.table package supports multithreading, allowing it to efficiently handle large datasets utilizing multiple CPU cores simultaneously. To set up multithreading in data.table, you need to do some configuration on your computer a bit (at least for Mac users). Follow this instruction.