Day 2: Data Wrangling with the data.table package
The R package data.table is an extension of the data.frame class and is equipped with lots of additional powerful functionalities designed to handle large datasets. It is built for speed and efficiency. Unlike functions of the dplyr package (another popular R package for data wrangling), in which you need to use a specific function for each task, the syntax of data.table is so simple and consistent that you can do almost everything with just a few commands.
In this lecture, we will learn the basic data wrangling skills with the data.table package. In addition, I will also briefly introduce the %>% operator of the magrittr package as a tool to make your code more concise and readable.
Learning Objectives
- To be able to use the basic data wrangling skills with the
data.tablepackage:- subset rows
- select and compute on columns
- rename columns
- perform aggregations by group
- merge multiple datasets
- reshape wide-to-long and long-to-wide, respectively
- to be able to use
%>%operator of themagrittrpackage.
Preparation
- This is the official website for the
data.tablepackage. Specifically, I recommend you take a look at the Introduction section. - Section 18: Pipes in the book of R for Data Science is a good introduction to the
%>%operator of themagrittrpackage.
Lecture Slides
Click here for Lecture 2’s slides.
Exercise problems: Exercise problems for Lecture 2. Solution is attached.
Quick view of the slides:
Supplementary Materials
Using .SD for Data Analysis (Although we will not cover this topic in the class, it would be useful if you know how to use .SD in data.table.)
The data.table package supports multithreading, allowing it to efficiently handle large datasets utilizing multiple CPU cores simultaneously. To set up multithreading in data.table, you need to do some configuration on your computer a bit (at least for Mac users). Follow this instruction.