Lecture 1: Exercise Problems and Solutions

1 Exercise: Vector

The following code randomly samples 30 numbers from a uniform distribution between 0 and 1, and stores the result in x.

# Run this code to work on the exercise problems.
set.seed(3746)
x <- runif(n = 30, min = 0, max = 1)
x # see what's inside x

Questions

Extract the 10th and the 15th elements of x.
Extract elements larger than $0.5$.
Replace the 10th and the 15th elements of x to 0.
If an element of x is larger than $0.9$, replace it with $1$.
Count the elements larger than $0.6$.

# === Part 1 === #
x[c(10, 15)]

# === Part 2 === #
x[x > 0.5]

# === Part 3 === #
x[c(10, 15)] <- 0

# === Part 4 === #
x[x > 0.9] <- 1

# === Part 5 === #
sum(x > 0.6)

2 Exercise: Matrix

Use the following matrix:

set.seed(3746)
num <- runif(n = 30, min = 0, max = 1)
mat <- matrix(data = num, nrow = 6)
colnames(mat) <- c("A", "B", "C", "D", "E")
rownames(mat) <- c("a", "b", "c", "d", "e", "f")
mat # see what's inside mat

Extract the element in the 2nd row and 3rd column.
Extract the 2nd row.
Subset the rows where column “A” is larger than 0.5. (Use logical indexing).

# === Part 1 === #
mat[2, 3]

# === Part 2 === #
mat[2, ]

# === Part 3 === #
mat[mat[, "A"] > 0.5, ]

3 Exercise: Data Frame

We will use the built-in dataset mtcars for this exercise. Run the following code to load the data.

# --- Load data --- #
data(mtcars)
?mtcars # to see the description of the yield_data

# --- Take a look at the data --- #
# head() function shows the first several rows of the data
head(mtcars)

Extract the rows corresponding to the cars with the row numbers 1, 5, and 10 using numeric indexing
Add a new column to the mtcars data frame called power_to_weight_ratio, which is calculated as the ratio of horsepower (hp) to weight (wt).
Create a new data frame called efficient_cars that contains cars with mpg greater than 20 and power-to-weight ratio less than 5.
(Optional) Sort the efficient_cars data frame by the power_to_weight_ratio column in ascending order and display the result. [Hints: (1)use order() function to sort the data frame. (2) Use order(efficient_cars$power_to_weight_ratio) as an index vector.]

# === Part 1 === #
mtcars[c(1, 5, 10), ]

# === Part 2 === #
mtcars$power_to_weight_ratio <- mtcars$hp / mtcars$wt

# === Part 3 === #
efficient_cars <- mtcars[mtcars$mpg > 20 & mtcars$power_to_weight_ratio < 5, ]

# === Part 4 === #
efficient_cars[order(efficient_cars$power_to_weight_ratio), ]

4 Exercise: Vector, Comprehensive

Create a sequence of numbers from 20 to 50 and name it x. Let’s change the numbers that are multiples of 3 to 0.
sample() is commonly used in Monte Carlo simulation in econometrics. Run the following code to create r. What does it do? Use ?sample to find out what the function does.

set.seed(12345) #don't worry about this
r <- sample(1:100, size=20, replace = TRUE)

Find the value of mean and SD of vector r without using mean() and sd()
Figure out which position contains the maximum value of vector r. (use which() function. Run ?which() to find out what the function does.)
Extract the values of r that are larger than 50.
Extract the values of r that are larger than 40 and smaller than 60.
Extract the values of r that are smaller than 20 or larger than 70.

# === Part 1 === #
x <- 20:50
# using `:` operator is the most basic way to create a sequence of numbers, but it only works with integer numbers with a step of 1.
# seq() function is more flexible. For example, you can create a sequence of numbers, , incremented by 0.5.
# x <- seq(from = 20, to = 50, by = 0.5)
x[x %% 3 == 0] <- 0

# === Part 2 === #
# In this code, sample() function creates a random sample of numbers with size 20 (size=20) from a range 1 to 100 (x = 1:100) allowing replacement (replace = TRUE).

# === Part 3 === #
# mean
mean_r <- sum(r) / length(r)
# SD
sd_r <- sqrt(sum((r - mean_r)^2) / (length(r) - 1))

# === Part 4 === #
max_index <- which(r == max(r))

# === Part 5 === #
r_50 <- r[r > 50]

# === Part 6 === #
r_40_60 <- r[r > 40 & r < 60]

# === Part 7 === #
r_20_70 <- r[r < 20 | r > 70]

5 Exercise: Data Frame, Comprehensive

Load the file nscg17small.dta. You can find the data in the Data folder.
- This data is a subset of the National Survey of College Graduates (NSCG) 2017, which collects data on the educational and occupational characteristics of college graduates in the United States.
Each row corresponds to a unique respondent. Let’s create a new column called “ID”. There are various ways to create an ID column. Here, let’s create an ID column that starts from 1 and increments by 1 for each row.
To take a quick look at the summary statistics of a specific column, summary() function is useful. Use summary() to create a table of the descriptive statistics for salary. You’ll provide salary column to summary() as a vector.
Create a new variable in your data that represents the z-score of the hours worked (use hrswk variable). \[Z = (x - \mu)/\sigma\] , where $Z = \text{standard score}$, $x =\text{observed value}$, $\mu = \text{mean of sample}$, and $\sigma = \text{standard deviation of the sample}$.
Calculate the share of observations in your data sample with above average hours worked.

# === Part 1 === #
library(rio)
nscg17 <- import("Data/nscg17small.dta")

# === Part 2 === #
nscg17$ID <- 1:nrow(nscg17)

# === Part 3 === #
summary(nscg17$salary)

# === Part 4 === #
nscg17$z_hrswk <- (nscg17$hrswk - mean(nscg17$hrswk)) / sd(nscg17$hrswk)
# or using with() function, you can write the code more concisely
# nscg17$z_hrswk2 <- with(nscg17, (hrswk - mean(hrswk)) / sd(hrswk))

# Note: For part 2 and 3, you can use within() function to create new columns more concisely.
# nscg17 <- 
#   within(
#     nscg17, {
#       ID <- 1:nrow(nscg17)
#       z_hrswk <- (hrswk - mean(hrswk)) / sd(hrswk)
#     }

# === Part 5 === #
# create a logical vector that indicates whether the hours worked is above average
above_avg_hrswk <- with(nscg17, z_hrswk > mean(z_hrswk)) # you can get the same result by using `hrswk`.
# subset the data
nscg17_above_avg_hrswk <- nscg17[above_avg_hrswk, ]
# calculate the share of observations with above average hours worked
share_above_avg_hrswk <- nrow(nscg17_above_avg_hrswk) / nrow(nscg17)
share_above_avg_hrswk