[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 4 5 6
Department of Applied Economics, University of Minnesota
In the previous lecture, we learned how to do regression analysis using R, which is a fundamental skill for econometric analysis.
Today, we will learn how to code Monte Carlo simulations in R. Monte Carlo simulations are a very important tool in learning econometrics and statistics. With Monte Carlo simulations, you can test any kind of statistical theory or property, which is very fun and useful!!
Note
Before we dive into the Monte Carlo simulation, we need to review some key R operations: + for loop function.
You can define your own functions. The beauty of creating your own functions is that you can reuse the same code over and over again without having to rewrite it. A function is more useful when the task is longer and more complicated
Example Situations
You can define your own functions using the function() function.
General Syntax
Note
function_name), what kind of inputs the function takes (arg1, arg2, etc.), and how the function processes using the given input objects.return() function is used to return the output of the function. By default, the output defined in the last line of the function is returned.1. A simple function
2. A function with multiple outputs
You can set default values for function arguments by argument = value.
Example:
If you have multiple functions or a long function, you might want to save the function in a separate file and load it when you need it.
For example:
.R file (.Rmd, etc.).source() function.Let’s practice this on your Rstudio!
Write a function (you can name it whatever you want) to calculate the area of a circle with a given radius. The function should return the area of the circle. Use pi, which is a built-in constant for the value of \(\pi\) in R.
Write a function to count the number of NA values in a given vector. (Hint: use the is.na() function.)
Write a function called calc_mad that calculates the Median Absolute Deviation (MAD) of a numeric vector. The MAD is a robust measure of variability, defined as the median of the absolute deviations from the median.(Hint: use the median() function to calculate the median of a vector. use the abs() function to calculate the absolute value of a vector.)
You’re a data expert at a store chain. The company needs to study its monthly sales growth to plan better. They expect sales to grow by a fixed percentage each month. Your job is to create an R function that shows sales growth over a year.
For sales growth, use the following formula:
\[S_t = S_0 \times (1 + g)^{t-1}\]
, where \(S_t\) is the sales in month \(t\) , \(S_0\) is the starting sales, and \(g\) is the growth rate.
Create a function called monthly_sales_growth with the following three inputs:
initial_sales: Starting sales (in thousands of dollars).growth_rate: Monthly growth rate (as a decimal, like 0.03 for 3% growth).months: How many months to predict (usually 12 for a year).The function should give back a vector of numbers (or it would be even better if you could show in a data.frame or data.table in which two columns, e.g., month and sales, show the expected sales for each month.)
Using Loop is useful when you want to repeat the same task (but with a slight change in parameters) over and over again.
Common Situations
While there are several looping commands in R (e.g., foreach, lapply, etc.), we will use the for loop function, as it is the most basic and widely used looping function in R.
The for loop function is a built-in R function. The syntax of the for loop is very simple.
Syntax:
You need to define the components in the function: (i) variable (ii) collection of objects, (iii) the code to be executed in each iteration.
Note
variable takes a value from the collection_of_objects in order and the code inside the loop is executed using the value of variable.collection_of_objects can be a vector or a list object.
1. Print the numbers from 1 to 5.
Variable i takes each number in the sequence 1:5 in order and print the value of i.
2. Print characters in a list.
Variable x takes each character in the list list("I", "like", "cats") in order and print the value of x.
3. Calculate the mean of each element in a list.
Can you tell me what’s going on in the following code?
In econometric class, we use the rnorm() function a lot! It is a function that generates random numbers from a normal distribution. See ?rnorm for more details.
The basic syntax is rnorm(n, mean = 0, sd = 1), where n is the number of random numbers you want to generate, mean is the mean of the normal distribution, and sd is the standard deviation of the normal distribution. So rnorm(n, mean =0, sd = 1) generates n random numbers from a standard normal distribution.
Generate 1000 random numbers from a standard normal distribution and calculate the mean the numbers (use mean() function), and print the results. Repeat this process 10 times using the for loop.
You can nest the for loop inside another for loop. For example,
Using the above code as a reference, fill in the following empty 3 x 3 matrix with the sum of the row and column indices.
The output should look like this:
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 4 5 6
Unlike R functions we have seen so far, for loop does not have a return value. It just iterates the process we defined in the loop.
Let’s do some experiments:
Every round of the loop, the variables defined inside the loop are updated.
x <- for (i in 1:3){print(i)} does not work).Example
Suppose you want to cube each number in the sequence 1:5.
Note
What if we want to have multiple outputs from the loop and combine them into a single dataset?
Example
Let’s generate 100 random numbers from a standard normal distribution and calculate the mean and the standard deviation of numbers. Repeat this process 10 times using the for loop and save the results in a dataset.
for loop, calculate the sum of the first n numbers for n = 1, 2, ..., 10. For example, the sum of the first 3 numbers (n=3) is 1 + 2 + 3 = 6. Save the results in a vector object.n numbers in the Fibonacci sequence. (You use the for loop function inside the function.) For example, when n = 5, the function should return c(0, 1, 1, 2, 3). For simplicity, let’s assume that \(n \ge 3\) (You don’t need to consider the case where n = 1 or 2).NOTE: It’s okay if you cannot solve this problem! See the solutions. I showed two approaches to solve this problem, and did speed comparison between the two approaches.
Pythagorean triples are sets of three positive integers \((a, b, c)\) that satisfy the equation \(a^2 + b^2 = c^2\), who are named after the ancient Greek mathematician Pythagoras.
Let’s take this concept further. Suppose Pythagoras challenges you to find all possible Pythagorean triples where \(a\) and \(b\) are less than or equal to a given number \(n\). To address this problem, let’s create an R function that will produce all such triples.
n, an integer, representing the maximum value for a and b. The function should return a data frame with columns a, b, and c, containing all Pythagorean triples where \(b \leq a \leq n\) and \(a^2 + b^2 = c^2\).Hints:
Reference: Pythagorean Triples
Up to this point, as long as you understand the following points, you are good to go!
function() to define a simple function yourself.for loop (i.e., syntax, which argument you need to define).Monte Carlo simulation is a technique to approximate a likelihood of possible outcomes (e.g., predictions, estimates) from a model by iteratively running the model on artificially created datasets. In every iteration, the datasets are randomly sampled from the assumed data generating process, so it varies every iteration.
In econometrics, the Monte Carlo simulation is used to evaluate the performance of a statistical procedure or the validity of theories in a realistic setting.
For example
Suppose that a researcher came up with a new estimator to estimate the coefficients of a regression model.
Think about the following example.
Example
This kind of experiment is modeled by the binomial distribution. According to the theory, it is predicted that
Is it true? Let’s check this using a Monte Carlo simulation!
Monte Carlo Simulation: Steps
step 1: Specify the data generating process.
step 2: Repeat:
step 3: compare your estimates with the true parameter
Let’s start writing code for a single iteration to get an idea of the Monte Carlo simulation process in R.
Weak law of large number states that the sample mean converges (in probability) to the population mean as the sample size increases. In other words, the sample mean more accurately estimates the population mean as the sample size increases.
\[\bar{X}_n = \frac{1}{n}\sum_{n=1}^{n} X_i E[X] \xrightarrow{p} E[X]\]
Lets check this using Monte Carlo simulation! We compare the distribution of sample mean with different sample size. Let’s compare two sample sizes: \(n=100\) and \(n=1000\).
Process
Repeat 1 and 2 for \(B=1000\) times.
Using a normal distribution with mean \(\mu = 5\) and \(sd = 10\), generate random numbers for \(n=100\) and \(n=1000\). e.g. rnorm(n = 10, mean = 5, sd = 10).
Compute sample mean for each sample data, and save them.
Finally,
Suppose you’re interested in estimating the unknown population mean of men’s heights (i.e., \(\mu\)) in the US. We have randomly sampled data with the size of \(n=1000\). Let \(X_i\) denote the individual \(i\)’s height in the sample. How should we use the sample data to estimate the population mean?
Your friends suggested two different estimators:
Estimator 1. Use the sample mean: \(\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i\)
Estimator 2. Use only the first observation: \(X_1\)
Theoretically, both estimators are unbiased estimators (i.e., if we repeat the estimation process many times on a different sample data every time, the average of the estimates will be equal to the population mean):
\[\begin{align*} E[\bar{X}_n] &= E \left[\frac{1}{n} \sum_{i=1}^{n} \right] = \frac{1}{n} E \left[\sum_{i=1}^{n} X_i \right] = \frac{1}{n} \sum_{i=1}^{n} E[X_i] = \frac{1}{n} \cdot n \cdot \mu = \mu \\ E[X_1] &= \mu \end{align*}\]
Questions:
Using Monte Carlo simulation, let’s examine these questions!
step 1. Draw \(n = 1000\) random numbers with known mean \(\mu\) and standard deviation \(\sigma\). This will be the sample data.
step 2. Get (1) the mean of the sample and (2) the value of the first observation, and save the results.
If you could also visually compare the distribution of estimates from the two estimators, that would be great!
The foreach function is a function of the foreach package. It is used to iterate the same process over and over again, similar to the for loop function.
Basic Syntax
While there are some differences, the basic syntax of the foreach function is pretty much similar to the for loop function.
foreach(variable = collection_of_objects) %do% {
the code to be executed in each iteration
return(output)
}Note
for loop and foreach function:
= instead of in.%do% operator.foreach function has a return value, while for loop does not. By default, the output is returned as a list.(* foreach function also supports parallel processing. (we will not cover this in this class.))
Using the for loop and foreach function, let’s calculate the square of the numbers from 1 to 10, respectively.
foreach
for loop
By default, foreach function returns each iteration’s result as a list. But you can choose the format of the output by using the .combine argument.
.combine = c combines each iteration’s result as as a vector (like c() to create a vector)..combine = rbind combines each iteration’s result by row..combine = cbind combines each iteration’s result by column.The last two options are used when the output is a matrix or a data.frame (or data.table).
Example
.combine options in the following code.