#/*--------------------------------*/
#' ## Part 1
#/*--------------------------------*/
# --- Create a function --- #
circle_area <- function(radius){
return(pi * radius^2)
}
# --- Test --- #
circle_area(3)
#/*--------------------------------*/
#' ## Part 2
#/*--------------------------------*/
# --- Create a function --- #
calc_mad <- function(x) {
# Remove missing values
x <- na.omit(x)
# Calculate the median of the vector
med <- median(x)
# Calculate the absolute deviations from the median
abs_dev <- abs(x - med)
# Return the median of the absolute deviations
return(median(abs_dev))
}
# --- Test --- #
test <- c(2, 4, 4, 6, 8)
calc_mad(x)
#/*--------------------------------*/
#' ## Part 3
#/*--------------------------------*/
# --- Create a function --- #
count_na <- function(vec){
return(sum(is.na(vec)))
}
# --- Test --- #
test <- c(1, 2, NA, 4, 5, NA)
count_na(test)Lectue 5 - Exercise Problems: Solutions
1 User-Defined Functions
1.1 Exercise 1
Write a function (you can name it whatever you want) to calculate the area of a circle with a given radius. The function should return the area of the circle. Use
pi, which is a built-in constant for the value of \(\pi\) in R.Write a function to count the number of NA values in a given vector. (Hint: use the
is.na()function.)Write a function called calc_mad that calculates the Median Absolute Deviation (MAD) of a numeric vector. The MAD is a robust measure of variability, defined as the median of the absolute deviations from the median.(Hint: use the
median()function to calculate the median of a vector. use theabs()function to calculate the absolute value of a vector.)
1.2 Exercise 2 (A Bit Advanced)
You’re a data expart at a store chain. The company needs to study its its monthly sales growth to plan better. They expect sales to grow by a fixed percentage each month. Your job is to create an R function that shows sales growth over a year.
For sales growth, use the following formula
\[S_t = S_0 \times (1 + g)^{t-1}\]
, where \(S_t\) is the sales in month \(t\) , \(S_0\) is the starting sales, and \(g\) is the growth rate.
Create a function called monthly_sales_growth with the following three inputs:
initial_sales: Starting sales (in thousands of dollars).growth_rate: Monthly growth rate (as a decimal, like 0.03 for 3% growth).months: How many months to predict (usually 12 for a year).
The function should give back a vector of numbers (or it would be nicer if you could show in a data.frame or data.table in which two columns, e.g., month and sales, show the expected sales for each month.)
Part 1
monthly_sales_growth <- function(initial_sales, growth_rate, months, discount_rate = 0){
sales <- rep(0, months)
for (i in 1:months){
sales[i] <- initial_sales * (1 + growth_rate)^(i-1) * (1 - discount_rate)
}
return(sales)
}If you don’t use for loop, you can use the following code:
monthly_sales_growth <- function(initial_sales, growth_rate, months) {
# Generate a sequence of months
month_seq <- 1:months
# Calculate the sales growth for each month
sales <- initial_sales * (1 + growth_rate) ^ (month_seq - 1)
return(sales)
}
# --- Test --- #
monthly_sales_growth(initial_sales = 50, growth_rate = 0.03, months = 12)2 Loops
2.1 Exercise 1 (Basics)
2.1.1 Problem 1
In econometric class, we use the rnorm() function a lot! It is a function that generates random numbers from a normal distribution. See ?rnorm for more details.
The basic syntax is rnorm(n, mean = 0, sd = 1), where n is the number of random numbers you want to generate, mean is the mean of the normal distribution, and sd is the standard deviation of the normal distribution. So rnorm(n, mean =0, sd = 1) generates n random numbers from a standard normal distribution.
Generate 1000 random numbers from a standard normal distribution and calculate the mean the numbers (use mean() function), and print the results. Repeat this process 10 times using the for loop.
for(i in 1:10){
random_numbers <- rnorm(1000)
mean_random_numbers <- mean(random_numbers)
print(mean_random_numbers)
}2.1.2 Problem 2
You can nest the for loop inside another for loop. For example,
# Outer loop
for (i in 1:3) {
# Inner loop
for (j in 1:2) {
print(paste("i =", i, "j =", j))
}
}
Using the above code as a reference, fill in the following empty 3 x 3 matrix with the sum of the row and column indices.
The output should look like this:
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 4 5 6
Here, is the empty 3 x 3 matrix.
empty_matrix <- matrix(NA, nrow = 3, ncol = 3)empty_matrix <- matrix(NA, nrow = 3, ncol = 3)
for (i in 1:3){
for (j in 1:3){
empty_matrix[i, j] <- i + j
}
}2.2 Exercise 2
Using the for loop, calculate the sum of the first n numbers for n = 1, 2, ..., 10. For example, the sum of the first 3 numbers (n=3) is 1 + 2 + 3 = 6. Save the results in a vector object.
# --- Create an empty vector --- #
output_storage <- rep(0, 10)
# --- loops --- #
for(i in 1:10){
output_storage[i] <- sum(1:i)
}2.3 Exercise 3
Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding ones (e.g. 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …). Write a function that generates the first n numbers in the Fibonacci sequence. (You use the for loop function inside the function.) For example, when n = 5, the function should return c(0, 1, 1, 2, 3). For simplicity, let’s assume that \(n \ge 3\) (You don’t need to consider the case where n = 1 or 2).
# --- Define a function --- #
# for example, when n=1, the function should return 0. When n=2, the function should return 0, 1. When n=3, the function should return 0, 1, 1.
gen_fibonacci_seq <- function(n){
empty_storage <- rep(0, n)
empty_storage[2] <- 1
for(i in 3:n){
empty_storage[i] <- empty_storage[i-1] + empty_storage[i-2]
}
return(empty_storage)
}
# --- test --- #
gen_fibonacci_seq(6)2.4 Exercise 4: Functions and Loops (Advanced, Interesting Problem!)
Pythagorean triples are sets of three positive integers \((a, b, c)\) that satisfy the equation \(a^2 + b^2 = c^2\), who are named after the ancient Greek mathematician Pythagoras.
Let’s take this concept further. Suppose Pythagoras challenges you to find all possible Pythagorean triples where \(a\) and \(b\) are less than or equal to a given number \(n\). To address this problem, let’s create an R function that will produce all such triples.
- Write a function that takes one argument
n, an integer, representing the maximum value foraandb. The function should return a data frame with columnsa,b, andc, containing all Pythagorean triples where \(b \leq a \leq n\) and \(a^2 + b^2 = c^2\).
- Consider using nested loops to iterate through all possible values of a and b up to n.
- Use the sqrt() function to calculate the potential value of c, and check if it’s an integer.
- Use the floor() function to round down the value of c.
There are many ways to do this. I will show you two approaches: one using loops and the other using vectorized operations. Then, I will show you the speed comparison between the two approaches.
#/*--------------------------------*/
#' ## Approach 1 (loops)
#/*--------------------------------*/
find_pythagorean_triples <- function(n){
# --- prepare an empty storages --- #
triples_storage_dt <-
data.frame(a = integer(), b = integer(), c = integer())
for(a in 1:n){
# for example a = 4
for(b in 1:a){
# for example b = 3
c <- sqrt(a^2 + b^2)
if(c == floor(c)){
res_triples_dt <- data.frame(a = a, b = b, c = c)
triples_storage_dt <- rbind(triples_storage_dt, res_triples_dt)
}
}
}
return(triples_storage_dt)
}
# --- test --- #
find_pythagorean_triples(n = 5)
#/*--------------------------------*/
#' ## Approach 2 (Vectorized)
#/*--------------------------------*/
library(data.table)
library(dplyr)
find_pythagorean_triples_vec <- function(n){
# for example, n = 5
# --- create all combinations of (a, b) --- #
abc_all_dt <-
CJ(a = 1:n, b = 1:n) %>%
.[, c := sqrt(a^2 + b^2)] %>%
# find a row where c is an integer
.[c == floor(c), ]
return(abc_all_dt)
}
# NOTE: #CJ() is a function from data.table package that creates all combinations of two vectors. (It is similar to expand.grid() function in base R.)
# --- test --- #
find_pythagorean_triples_vec(n = 5)
#/*--------------------------------*/
#' ## Speed Competition
#/*--------------------------------*/
system.time(find_pythagorean_triples(n = 10^4))
system.time(find_pythagorean_triples_vec(n = 10^4))
# Lesson: Vectorized operations are much faster than loops in R!2.5 Exercise 5: Combining Datasets using Loop
In the “corn_yield_by_sates” folder, you will find corn yield datasets(2000-2022) by state. Leveraging your expertise in R programming, use the loop function (or foreach()), load each dataset, and combine them into a single dataset.
Hint: You can use list.files () function. It is a built-in function in R that returns a character vector of file paths in the specified folder. The syntax is list.files(path = “path to the folder”, full.names = TRUE). The ‘full. names’ argument, when set to TRUE, returns the full file paths instead of just the file names. Then, you can use those file paths to load data. You need to specify the path to the target folder in the path argument.
Keep in mind there are various ways to approach this problem!
Make sure that you open the R project fo this course. If so, the following code should work on your computer.
There are lots of approach. The most easy one is to use foreach function (in my opinion).
# --- Get a list of pathes to the target files --- #
ls_path_yield <- list.files(path = "Data/corn_yield_by_states", full.names = TRUE)
# --- Create an empty data.frame as a storage --- #
storage_yield_dt <- data.frame()
# --- for loop --- #
for (i in 1: length(ls_path_yield)){
# Load a datset using i th path in ls_path_yield
temp_dt <- readRDS(ls_path_yield[[i]])
# Combine temp_dt and storage_yield_dt by row using rbind().
# This way, storage_yield_dt is updated in each iteration.
storage_yield_dt <- rbind(storage_yield_dt, temp_dt)
}Another similar approach: You can use list() or vector() to store the datasets in the loop.
# --- Get a list of pathes to the target files --- #
ls_path_yield <- list.files(path = "Data/corn_yield_by_states", full.names = TRUE)
# --- Create an empty storage vector with the same length as --- #
ls_yield_dt <- list()
# --- for loop --- #
for (i in 1: length(ls_path_yield)){
# Load a datset using i th path in ls_path_yield
temp_dt <- readRDS(ls_path_yield[[i]])
# Save in the storage object
ls_yield_dt[[i]] <- ls_yield_dt
}
# To combine the datasets in the list, use rbindlist() function() from data.table package.
yield_dt <- rbindlist(ls_yield_dt)# --- Get a list of pathes to the target files --- #
ls_path_yield <- list.files(path = "Data/corn_yield_by_states", full.names = TRUE)
# --- for loop --- #
yield_dt <- foreach(file_path_i = ls_path_yield, .combine = rbind) %do% {
# Load a datset
temp_dt <- readRDS(file_path_i)
return(temp_dt)
}The foreach function returns the last value of the loop by default. In this case, the temp_dt object is returned in each iteration, so return(temp_dt) is unnecessary. Once all iterations are completed, the output datasets are combined by row (because I specified .combine = rbind).
3 Monte Carlo Simulation
3.1 Exercise 1: (Weak) Low of Large Number
Weak law of large number states that the sample mean converges (in probability) to the population mean as the sample size increases. In other words, the sample mean more accurately estimates the population mean as the sample size increases.
\[\bar{X}_n = \frac{1}{n}\sum_{n=1}^{n} X_i E[X] \xrightarrow{p} E[X]\]
Lets check this using Monte Carlo simulation! We compare the distribution of sample mean with different sample size. Let’s compare two sample sizes: \(n=100\) and \(n=1000\).
Process
Repeat 1 and 2 for \(B=1000\) times.
Using a normal distribution with mean \(\mu = 5\) and \(sd = 10\), generate random numbers for \(n=100\) and \(n=1000\). e.g.
rnorm(n = 10, mean = 5, sd = 10).Compute sample mean for each sample data, and save them.
Finally,
- Plot histograms of the sample means obtained from the two samples.
set.seed(1843)
# --- Parameters --- #
mu <- 5
sd <- 10
B <- 1000
# --- Create an empty object to save the output --- #
sample_mean_100 <- rep(0, B)
sample_mean_1000 <- rep(0, B)
# --- for loop --- #
for (i in 1:B){
# --- Generate random numbers --- #
sample_100 <- rnorm(n = 100, mean = mu, sd = sd)
sample_1000 <- rnorm(n = 1000, mean = mu, sd = sd)
# --- Compute sample mean --- #
sample_mean_100[i] <- mean(sample_100)
sample_mean_1000[i] <- mean(sample_1000)
}
# --- Plot --- #
ggplot() +
geom_histogram(
aes(x = sample_mean_100, fill = "Sample size: 100"),
alpha = 0.7
) +
geom_histogram(
aes(x = sample_mean_1000, fill = "Sample size: 1000"),
alpha = 0.7
) +
geom_vline(
xintercept = mu,
linetype = "dashed",
) +
theme_bw()3.2 Exercise 2: Two estimators to estimate the population mean?
Suppose you’re interested in estimating the unknown population mean of men’s heights (i.e., \(\mu\)) in the US. We have randomly sampled data with the size of \(n=1000\). Let \(X_i\) denote the individual \(i\)’s height in the sample. How should we use the sample data to estimate the population mean?
Your friends suggested two different estimators:
Estimator 1. Use the sample mean: \(\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i\)
Estimator 2. Use only the first observation: \(X_1\)
Theoretically, both estimators are unbiased estimators (i.e., if we repeat the estimation process many times on a different sample data every time, the average of the estimates will be equal to the population mean):
\[\begin{align*} E[\bar{X}_n] &= E \left[\frac{1}{n} \sum_{i=1}^{n} \right] = \frac{1}{n} E \left[\sum_{i=1}^{n} X_i \right] = \frac{1}{n} \sum_{i=1}^{n} E[X_i] = \frac{1}{n} \cdot n \cdot \mu = \mu \\ E[X_1] &= \mu \end{align*}\]
Questions:
- Is it true that both estimators are correctly estimating the population mean, on average?
- Which one is more accurate in estimating the population mean?
Using Monte Carlo simulation, let’s examine these questions!
- Repeat the following processes 1000 times:
step 1. Draw \(n = 1000\) random numbers with known mean \(\mu\) and standard deviation \(\sigma\). This will the sample data.
step 2. Get (1) the mean of the sample and (2) the value of the first observation, and save the results.
- The 2000 estimates of the population mean for estimator 1 and estimator 2, respectively. Compute the means for each estimator. Are they both close to the true population mean? Compute the variance of the estimates. Which one has a smaller variance?
If you could also visually compare the distribution of estimates from the two estimators, that would be great!
# --- Load packages --- #
library(data.table)
library(ggplot2)
# --- Parameter setting --- #
B = 1000
n = 1000
mu = 170 #(cm)
# --- Prepare a storage object --- #
res_data <-
data.table(
x_bar = rep(0, B),
x_single = rep(0, B)
)
# --- Monte Carlo simulation --- #
for(i in 1:B){
# --- Create sample data --- #
data <- rchisq(n, df=mu)
# --- Compute the sample mean --- #
x_bar <- mean(data)
# --- Get the 1st value --- #
x_single <- data[1]
# --- store the results --- #
res_data[i, "x_bar"] <- x_bar
res_data[i, "x_single"] <- x_single
}
# === Visualization === #
# Just for convenience, transform the data to long-format (you don't need to do this though)
vis_sampling <-
ggplot(res_data) +
geom_histogram(aes(x=x_bar, fill="x_bar"), alpha = 0.5)+
geom_histogram(aes(x=x_single, fill="x_single",), alpha = 0.5)+
labs(title="Sample size 1000")+
theme_bw() +
# modify x-label and legend name
labs(x = expression (~hat(beta)), fill = "Estimator")