Department of Applied Economics, University of Minnesota
command + enter (Mac) or Control + enter (Windows) on your keyboard. Alternatively, you can click the “Run Code” button on the top left corner of the code chunk.Note
Rules
R programming language is object-oriented programming (OOP), which basically means: “Everything is an object and everything has a name.”
You can assign information (numbers, character, data) to an object with <- or = (e.g., object_name <- value) and reuse it later.
x <- 1 assigns 1 to an object called x.If you assign information to an object of the same name you already used, the object that had the same name will be overwritten.
Once objects are created, you can evaluate them to see what’s inside.
Example
Rules
_ or . in the object name.Rules
R has a lot of packages that provide additional functions and data. To use the functions in the package:
You need to install the package with install.packages("package_name"). (You need to do this only once.)
Whenever you want to use the functions in the package in the current R session, you need to load the package with library(package_name).
could not find function "xxxx" if you forget to load the package.These are the basic data elements in R.
| Data Type | Description | Example |
|---|---|---|
| numeric | General number, can be integer or decimal. |
5.2, 3L (the L makes it integer) |
| character | Text or string data. | "Hello, R!" |
| logical | Boolean values. |
TRUE, FALSE
|
| integer | Whole numbers. |
2L, 100L
|
| complex | Numbers with real and imaginary parts. | 3 + 2i |
| raw | Raw bytes. | charToRaw("Hello") |
| factor | Categorical data. Can have ordered and unordered categories. | factor(c("low", "high", "medium")) |
Note
Use class() or is.XXX() to examine the data types.
You can convert one type of data to another type of data using the as.XXX() function.
Three conversion functions that are used often are:
Rules
TRUE, FALSE (and NA, which means “not available” or “undefined value”).< (less than), > (greater than), <= (less-than-or-equal), >= (greater-than-or-equal), == (equal), and != (not-equal).TRUE is treated as 1 and FALSE is treated as 0.==, !=, >, <, >=, <=.& (and), | (or), and ! (not).Key points
At this point,
class() function.as.XXX() function.Depending on how the data is stored, R has several types of data structures.
| Data Structure | Description | Creation Function | Example |
|---|---|---|---|
| Vector | One-dimensional; Holds elements of the same type. | c() |
c(1, 2, 3, 4) |
| Matrix | Two-dimensional; Holds elements of the same type. | matrix() |
matrix(1:4, ncol=2) |
| Array | Multi-dimensional; Holds elements of the same type. | array() |
array(c(1:12), dim = c(2, 3, 2)) |
| List | Can hold elements of different types. | list() |
list(name="John", age=30, scores=c(85, 90, 92)) |
| Data Frame | Like a table; Each column can hold different data types. This is the common data structure. | data.frame() |
data.frame(name=c("John", "Jane"), age=c(30, 25)) |
Note
Here, we focus on how to create and how to use each of the data structures.
Basics
You can retrieve single or multiple elements of a vector by indexing with [] brackets. Inside [], you simply provide another vector containing the position of the element you want to extract.
If a vector has names, you can also use the name of the element to extract it.
To modify a specific element, you can assign a new value to the position you want to modify.
Example
Example
The following figure explains the mechanism of logical indexing.

The following code randomly samples 30 numbers from a uniform distribution between 0 and 1, and stores the result in x.
Questions
x.x to 0.x is larger than \(0.9\), replace it with \(1\).matrix() function is used to create a matrix.Syntax
Note
vector_data and the number_of_rows and number_of_columns.vector_data is a multiple of number_of_columns (or number_of_rows), R will automatically figure how many rows (or columns) are needed.byrow = TRUE to fill the matrix by row. By default, the value in vector_data is filled by column.Again, you can access the elements of a matrix using [] brackets. But now you have options to specify the row and column index.
Example
You can add column names and row names to a matrix using colnames() and rownames() functions. If a matrix has column names and row names, you can use the names as the index.
Use the following matrix:
data.frame class object is like a matrix but it can store any type of data in each column.Syntax
Example
If column names are not provided, R will assign column names to those columns automatically.
Again, you can access the elements of a data.frame using [ ] (brackets) operator. But you need to specify the row and column index like you did in the matrix.
As an index vector, you can use a vector of logical values, column names, and positional index.
You can also extract specific column values using $ or [[ ]] operator.
$ and [[ ]] can only select a single column and return as a vector, whereas [ ] can select multiple columns ((see ?"$", ?"[" and ?"[[" about the difference in those operators).
Inside [[ ]] you provide the column name as a character.
Why do we need this? As you’ll see later, vector data is the most common input data structure when doing basic linear algebra in R (mean, sum, sqrt, etc.). (and it’s the fastest way to do the calculation in R!)
You can add a new column to a data.frame object using the $ operator.
Syntax
vector_data to be added must have the same length as the number of rows in the data.frame, otherwise the value is recycled.We will use the built-in dataset mtcars for this exercise. Run the following code to load the data.
Extract the rows corresponding to the cars with the row numbers 1, 5, and 10 using numeric indexing
Add a new column to the mtcars data frame called power_to_weight_ratio, which is calculated as the ratio of horsepower (hp) to weight (wt).
Create a new data frame called efficient_cars that contains cars with mpg greater than 20 and power-to-weight ratio less than 5.
(Optional) Sort the efficient_cars data frame by the power_to_weight_ratio column in ascending order and display the result. [Hints: (1)use order() function to sort the data frame. (2) Use order(efficient_cars$power_to_weight_ratio) as an index vector.]
with() and within() functions are useful when you want to do some operation on a data.frame.
with(data_frame, function(column1)) function allows you to evaluate an expression in the context of a data.frame.
within() function is similar to with(), but it allows you to modify the data.frame in place.
With these functions, you can avoid typing the data frame name and $ mark, repeatedly.
Example
A list in R can store elements of different types and sizes, including numbers, characters, vectors, matrices, data frames, and even other lists.
A list is a collection of data that can have any data and data structure type as its element. You can create a list using list() function.
You can access list elements using $ amd [ ] or [[ ]] brackets.
[ ] operator returns a list of the selected elements.[[ ]] and $ operators return any single element as it is. $ can be used only when the list has names.Here are the key points I want you to know:
Key Points
vector, matrix, data.frame, and list in R.
[ ], $, and [[ ]] operators.For the calculation of remainder and quotient, you don’t need to remember the operator.
Arithmetic operations of vectors are performed by element-wise (element by element in the same position).
If you want to do matrix multiplication, you need to use %*% operator. Otherwise, it will be element-wise multiplication.
.Rdata (or .rdata) and .Rds (or .rds)
.Rdata is used to save multiple R objects,.Rds is used to save a single R object.Rdata format
load("path_to_Rdata_file")
save(object_name, file = "path_to_Rdata_file")
.Rds format
readRDS("path_to_Rds_file")
saveRDS(object_name, file = "path_to_Rds_file")
To access to the data file, you need to provide the path to the file (the location of the data file).
Example
Suppose that I want to load flight.rds in the Data folder. On my computer, the full path (i.e., absolute path) to the file is /Users/shunkeikakimoto/Dropbox/git/R_summer_2024/Data/flight.rds.
Problems
getwd() function..R file) uses your home directory as the working directory.setwd() to designate a directory as the working directory:Example 1
In my case, I set the working directory to the Data folder.
Now, R will look for the data file in the Data folder by default. So, I can load the data using relative path, not absolute path.
Problems
setwd() relies on an absolute file path, which might vary by person (e.g., some person save folder in Dropbox, other person uses Google Drive).setwd() does not solve the second problem completely (i.e, If you are working with a team, the path to the data file is different for each person.)“R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. This is such a wise and common practice that RStudio has built-in support for this via projects.” - R for Data Science Ch 8.4
RStudio Projects
An RStudio project is a way to organize your work.
Once an R project is loaded, it automatically sets current working directory to the folder where that .Rproj file is saved (you don’t need to use setwd()!).
As long as the folder structure in the project folder is the same (relative path from the folder containing .Rproj file), you can share the code involving data loading with your team members.
Follow this steps illustrated in this document: R for Data Science Ch 8.4
.Rproj file via Finder (Mac) or File Explorer (Windows).getwd() function. You will see the path to the project folder.flight.rds data file with readRDS() function..csv, .xls(x), and.dta.read.csv() to read a .csv fileread_excel() from the readxl package to read data sheets from an .xls(x) fileread.dta13() function from the readstata13 package to read a STATA data file (.dta)Use import() function of the rio package
import() function from the rio package might be the most convenient one to load various format of data.
read.csv() and read.dta13() which specialize in reading a specific type of file, import() can load data from various sources.In Data folder, flight data is saved with three different formats: flight.csv, flight.dta, and flight.xlsx. Let’s load the data using import() function on your Rstudio.
.rds format.
saveRDS(object_name, path_to_save)
Reasons + If you work with R, there is no reason to save the data in other formats than .rds. + .rds format is more efficient in terms of saving and loading the data. + Check the size of the flight data files in different formats. Which one is the smallest?
Let’s try!
flight data in the Data folder.Important
.Rproj) is a useful tool to organize your work. As long as the folder structure under the .Rproj is the same, you can share the code involving data loading with your team members.readRDS() function for .Rds (.rds) format.import() function from the rio package for various format..rds format and use saveRDS() function.Create a sequence of numbers from 20 to 50 and name it x. Let’s change the numbers that are multiples of 3 to 0.
sample() is commonly used in Monte Carlo simulation in econometrics. Run the following code to create r. What does it do? Use ?sample to find out what the function does.
r without using mean() and sd()
r. (use which() function. Run ?which() to find out what the function does.)r that are larger than 50.r that are larger than 40 and smaller than 60.r that are smaller than 20 or larger than 70.nscg17small.dta. You can find the data in the Data folder.
summary() function is useful. Use summary() to create a table of the descriptive statistics for salary. You’ll provide salary column to summary() as a vector.hrswk variable). \[Z = (x - \mu)/\sigma\] , where \(Z = \text{standard score}\), \(x =\text{observed value}\), \(\mu = \text{mean of sample}\), and \(\sigma = \text{standard deviation of the sample}\).| Function | Description |
|---|---|
length() |
get the length of the vector and list object |
nrow(),ncol()
|
get the number of rows or columns |
dim() |
get the dimension of the data |
rbind(),cbind()
|
Combine R Objects by rows or columns |
colMeans(), rowMeans()
|
calculate the mean of each column or row |
with and within()
|
You don’t need to use $ every time you access to the column of the data.frame. |
ifelse() |
create a binary variable |
paste(), paste0()
|
concatenate strings |
| Function | Description |
|---|---|
sum(), mean(), var(), sd(), cov(), cor(), max(), min(), abs(), round() |
|
log() and exp()
|
Logarithms and Exponentials |
sqrt() |
Computes the square root of the specified float value. |
seq() |
Generate a sequence of numbers |
sample() |
randomly sample from a vector |
rnorm() |
generate random numbers from normal distribution |
runif() |
generate random numbers from uniform distribution |