Day 5: Function, Loops, and Monte Carlo Simulations

Department of Applied Economics, University of Minnesota

Shunkei Kakimoto

Introduction

In the previous lecture, we learned how to do regression analysis using R, which is a fundamental skill for econometric analysis.
Today, we will learn how to code Monte Carlo simulations in R. Monte Carlo simulations are a very important tool in learning econometrics and statistics. With Monte Carlo simulations, you can test any kind of statistical theory or property, which is very fun and useful!!

Note

Before we dive into the Monte Carlo simulation, we need to review some key R operations: + for loop function.

Although we won’t use this in today’s Monte Carlo simulation, we’ll also review how to write your own R functions because it’s a useful skill.

Learning Objectives

to be able to write code for your own R functions.
to be able to write code for a Monte Carlo simulation using the loop function.

Reference

Section 19 Functions
Section 20 Iteration in R for Data Science

Today’s Outline

User-Defined Functions
- Exercise Problems
Loops
Introduction to Monte Carlo Simulations
- Demonstration
- Exercise Problems

User-Defined Functions

You can define your own functions. The beauty of creating your own functions is that you can reuse the same code over and over again without having to rewrite it. A function is more useful when the task is longer and more complicated

Example Situations

When you want to automate the process of data cleaning.
When you do a complicated simulation or resampling methods, such as bootstrapping or Monte Carlo simulations.

You can define your own functions using the function() function.

General Syntax

function_name <- function(arg1, arg2, ...){
  code to be executed

  return(output)
}

Note

You need to define the function name (function_name), what kind of inputs the function takes (arg1, arg2, etc.), and how the function processes using the given input objects.
The return() function is used to return the output of the function. By default, the output defined in the last line of the function is returned.

1. A simple function

2. A function with multiple outputs

You can set default values for function arguments by argument = value.

Example:

Load Functions from a File

If you have multiple functions or a long function, you might want to save the function in a separate file and load it when you need it.

For example:

Save the function code file in a .R file (.Rmd, etc.).
Load the function using the source() function.

Let’s practice this on your Rstudio!

Exercise

Problem 1
Problem 2 (optional)

Write a function (you can name it whatever you want) to calculate the area of a circle with a given radius. The function should return the area of the circle. Use pi, which is a built-in constant for the value of \(\pi\) in R.
Write a function to count the number of NA values in a given vector. (Hint: use the is.na() function.)
Write a function called calc_mad that calculates the Median Absolute Deviation (MAD) of a numeric vector. The MAD is a robust measure of variability, defined as the median of the absolute deviations from the median.(Hint: use the median() function to calculate the median of a vector. use the abs() function to calculate the absolute value of a vector.)

You’re a data expert at a store chain. The company needs to study its monthly sales growth to plan better. They expect sales to grow by a fixed percentage each month. Your job is to create an R function that shows sales growth over a year.

For sales growth, use the following formula:

\[S_t = S_0 \times (1 + g)^{t-1}\]

, where \(S_t\) is the sales in month \(t\) , \(S_0\) is the starting sales, and \(g\) is the growth rate.

Create a function called monthly_sales_growth with the following three inputs:

initial_sales: Starting sales (in thousands of dollars).
growth_rate: Monthly growth rate (as a decimal, like 0.03 for 3% growth).
months: How many months to predict (usually 12 for a year).

The function should give back a vector of numbers (or it would be even better if you could show in a data.frame or data.table in which two columns, e.g., month and sales, show the expected sales for each month.)

Loops

Why loop?

Using Loop is useful when you want to repeat the same task (but with a slight change in parameters) over and over again.

Common Situations

Downloading the data from the web iteratively.
- When you want to download the ag-production data from USDA-NASS, you are limited to download 50,000 records per query. You need to repeatedly download the data until you get all the data you need.
- USDA crop scale data, NOAA weather data, etc.
Loading multiple data files in a folder.
Running the same regression analysis for multiple datasets.
Running simulations or resampling methods, such as bootstrapping or Monte Carlo simulations.

While there are several looping commands in R (e.g., foreach, lapply, etc.), we will use the for loop function, as it is the most basic and widely used looping function in R.

For loops

Basics
Examples

The for loop function is a built-in R function. The syntax of the for loop is very simple.

Syntax:

for (variable in collection_of_objects){
  the code to be executed in each iteration
}

You need to define the components in the function: (i) variable (ii) collection of objects, (iii) the code to be executed in each iteration.

Note

In each iteration variable takes a value from the collection_of_objects in order and the code inside the loop is executed using the value of variable.
collection_of_objects can be a vector or a list object.
- e.g., a sequence of numbers or characters, a list of datasets, etc.

1. Print the numbers from 1 to 5.

Variable i takes each number in the sequence 1:5 in order and print the value of i.

2. Print characters in a list.

Variable x takes each character in the list list("I", "like", "cats") in order and print the value of x.

3. Calculate the mean of each element in a list.

Can you tell me what’s going on in the following code?

Exercise

Exercise 1
Exercise 2 (nested loop)

In econometric class, we use the rnorm() function a lot! It is a function that generates random numbers from a normal distribution. See ?rnorm for more details.

The basic syntax is rnorm(n, mean = 0, sd = 1), where n is the number of random numbers you want to generate, mean is the mean of the normal distribution, and sd is the standard deviation of the normal distribution. So rnorm(n, mean =0, sd = 1) generates n random numbers from a standard normal distribution.

Generate 1000 random numbers from a standard normal distribution and calculate the mean the numbers (use mean() function), and print the results. Repeat this process 10 times using the for loop.

You can nest the for loop inside another for loop. For example,

Using the above code as a reference, fill in the following empty 3 x 3 matrix with the sum of the row and column indices.

The output should look like this:

     [,1] [,2] [,3]
[1,]    2    3    4
[2,]    3    4    5
[3,]    4    5    6

For loops: How to Save the loop output?

Introduction
Basics

Unlike R functions we have seen so far, for loop does not have a return value. It just iterates the process we defined in the loop.

Let’s do some experiments:

Experiment 1
Experiment 2

Every round of the loop, the variables defined inside the loop are updated.

You cannot assign loop to a variable directly (e.g x <- for (i in 1:3){print(i)} does not work).

To save the results of the loop, you need to create an empty object before the loop and save the output in the object in each iteration. (You did this in the exercise 2!)
- The object can be a vector, a list, a matrix, or a data frame (or data.table), depending on the type of the output you want to save.

Example

Suppose you want to cube each number in the sequence 1:5.

Note

Since the output of each iteration is a number, vector is a good choice for the storage object. (Alternatively you can use a list object.)

Multiple Outputs

What if we want to have multiple outputs from the loop and combine them into a single dataset?

Example

Let’s generate 100 random numbers from a standard normal distribution and calculate the mean and the standard deviation of numbers. Repeat this process 10 times using the for loop and save the results in a dataset.

Exercise

Problem 1
Problem 2 (optional)

Using the for loop, calculate the sum of the first n numbers for n = 1, 2, ..., 10. For example, the sum of the first 3 numbers (n=3) is 1 + 2 + 3 = 6. Save the results in a vector object.

Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding ones (e.g. 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …). Write a function that generates the first n numbers in the Fibonacci sequence. (You use the for loop function inside the function.) For example, when n = 5, the function should return c(0, 1, 1, 2, 3). For simplicity, let’s assume that \(n \ge 3\) (You don’t need to consider the case where n = 1 or 2).

NOTE: It’s okay if you cannot solve this problem! See the solutions. I showed two approaches to solve this problem, and did speed comparison between the two approaches.

Pythagorean triples are sets of three positive integers \((a, b, c)\) that satisfy the equation \(a^2 + b^2 = c^2\), who are named after the ancient Greek mathematician Pythagoras.

Let’s take this concept further. Suppose Pythagoras challenges you to find all possible Pythagorean triples where \(a\) and \(b\) are less than or equal to a given number \(n\). To address this problem, let’s create an R function that will produce all such triples.

Write a function that takes one argument n, an integer, representing the maximum value for a and b. The function should return a data frame with columns a, b, and c, containing all Pythagorean triples where \(b \leq a \leq n\) and \(a^2 + b^2 = c^2\).

Hints:

Consider using nested loops to iterate through all possible values of a and b up to n.
Use the sqrt() function to calculate the potential value of c, and check if it’s an integer.
Use the floor() function to round down the value of c.

Reference: Pythagorean Triples

Check Point

Up to this point, as long as you understand the following points, you are good to go!

You know how to use function() to define a simple function yourself.
You know how to use for loop (i.e., syntax, which argument you need to define).
You know that you need to prepare an empty object to save the output of the loop.

Introduction to Monte Carlo Simulations

What is it?

Monte Carlo simulation is a technique to approximate a likelihood of possible outcomes (e.g., predictions, estimates) from a model by iteratively running the model on artificially created datasets. In every iteration, the datasets are randomly sampled from the assumed data generating process, so it varies every iteration.

The incorporation of randomness in the simulation is the key feature of the Monte Carlo simulation. It is mimicking the randomness of real-world phenomena.

So, how is the Monte Carlo simulation used in Econometrics?

In econometrics, the Monte Carlo simulation is used to evaluate the performance of a statistical procedure or the validity of theories in a realistic setting.

For example

Suppose that a researcher came up with a new estimator to estimate the coefficients of a regression model.

An estimator (e.g, sample mean, standard error, OLS) is a function of a random variable, therefore it is also a random variable.
A random variable has its own probability distribution.
So, to understand the performance of the estimator (e.g., unbiasedness and efficiency), we need to examine the properties of the probability distribution of the estimator.
We use Monte Carlo simulation to approximate the probability distribution of the estimator!

In this world, everything is random and uncertain. In terms of econometric analysis, the data you get is just a realization of the random process. If you have another sample, you will get a different result. Because the data is random, the result of the estimation is also random variable.
This uncertainty, or randomness, is called sampling variability.
In statistics, any random variable is assumed to have some probability distribution.
Monte Carlo simulation mimics this randomness utilizing the random number generator, and produces the probability distribution of the estimator.
step0: The data at your hand is (usually) just a small portion of the whole population.
Monte Carlo simulation is used in a variety of fields such as physics, finance, and engineering, as well as in econometrics and statistics.

Example: Binomial Distribution

Think about the following example.

Example

Suppose that we flip a coin \(n=10\) times and count the number of heads. Let’s denote the number of heads \(X\).
The coin is not fair, however. The probability of getting a head is \(p= Pr[heads] = 1/3\).
Suppose that you repeat this experiment \(1000\) times. What is the mean and the variance of \(X\)?

This kind of experiment is modeled by the binomial distribution. According to the theory, it is predicted that

Mean of \(X\) is \(E[X] = np = 10 \times 1/3 = 3.33\)
Variance of \(X\) is \(Var[X] = np(1-p) = 10 \times 1/3 \times 2/3 = 2.22\)

Is it true? Let’s check this using a Monte Carlo simulation!

Monte Carlo Simulation: Steps

step 1: Specify the data generating process.

You need to pick a specific probability distribution to generate a random number.

step 2: Repeat:

step 2.1: generate a (pseudo) random sample data based on the data generating process.
step 2.2: get an outcome you are interested in based on the generated data.

step 3: compare your estimates with the true parameter

Demonstration: Binomial Distribution

A Single Iteration
Multiple Iterations

Let’s start writing code for a single iteration to get an idea of the Monte Carlo simulation process in R.

We want to repeat this 1000 times.

Exercise Problem

Exercise Problem: (Weak) Low of Large Number

Instructions
Solution

Weak law of large number states that the sample mean converges (in probability) to the population mean as the sample size increases. In other words, the sample mean more accurately estimates the population mean as the sample size increases.

\[\bar{X}_n = \frac{1}{n}\sum_{n=1}^{n} X_i E[X] \xrightarrow{p} E[X]\]

Lets check this using Monte Carlo simulation! We compare the distribution of sample mean with different sample size. Let’s compare two sample sizes: \(n=100\) and \(n=1000\).

Process

Repeat 1 and 2 for \(B=1000\) times.

Using a normal distribution with mean \(\mu = 5\) and \(sd = 10\), generate random numbers for \(n=100\) and \(n=1000\). e.g. rnorm(n = 10, mean = 5, sd = 10).
Compute sample mean for each sample data, and save them.

Finally,

Plot histograms of the sample means obtained from the two samples.

Exercise Problem: Two estimators to estimate the population mean?

Instructions
Hint
Solution

Suppose you’re interested in estimating the unknown population mean of men’s heights (i.e., \(\mu\)) in the US. We have randomly sampled data with the size of \(n=1000\). Let \(X_i\) denote the individual \(i\)’s height in the sample. How should we use the sample data to estimate the population mean?

Your friends suggested two different estimators:

Estimator 1. Use the sample mean: \(\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i\)

Estimator 2. Use only the first observation: \(X_1\)

Theoretically, both estimators are unbiased estimators (i.e., if we repeat the estimation process many times on a different sample data every time, the average of the estimates will be equal to the population mean):

\[\begin{align*} E[\bar{X}_n] &= E \left[\frac{1}{n} \sum_{i=1}^{n} \right] = \frac{1}{n} E \left[\sum_{i=1}^{n} X_i \right] = \frac{1}{n} \sum_{i=1}^{n} E[X_i] = \frac{1}{n} \cdot n \cdot \mu = \mu \\ E[X_1] &= \mu \end{align*}\]

Questions:

Is it true that both estimators are correctly estimating the population mean, on average?
Which one is more accurate in estimating the population mean?

Using Monte Carlo simulation, let’s examine these questions!

Repeat the following processes 1000 times:

step 1. Draw \(n = 1000\) random numbers with known mean \(\mu\) and standard deviation \(\sigma\). This will be the sample data.

step 2. Get (1) the mean of the sample and (2) the value of the first observation, and save the results.

The previous iterations produce 1000 estimates of the population mean for estimator 1 and estimator 2, respectively. Compute the means for each estimator. Are they both close to the true population mean? Compute the variance of the estimates. Which one has a smaller variance?

If you could also visually compare the distribution of estimates from the two estimators, that would be great!

Appendix

foreach Function

Basics
Example
Change the Output Format

The foreach function is a function of the foreach package. It is used to iterate the same process over and over again, similar to the for loop function.

Basic Syntax

While there are some differences, the basic syntax of the foreach function is pretty much similar to the for loop function.

foreach(variable = collection_of_objects) %do% {
  the code to be executed in each iteration

  return(output)
}

Note

Differences between for loop and foreach function:
- use = instead of in.
- You need to use %do% operator.
- foreach function has a return value, while for loop does not. By default, the output is returned as a list.

(* foreach function also supports parallel processing. (we will not cover this in this class.))

Using the for loop and foreach function, let’s calculate the square of the numbers from 1 to 10, respectively.

foreach

for loop

By default, foreach function returns each iteration’s result as a list. But you can choose the format of the output by using the .combine argument.

.combine = c combines each iteration’s result as as a vector (like c() to create a vector).
.combine = rbind combines each iteration’s result by row.
.combine = cbind combines each iteration’s result by column.

The last two options are used when the output is a matrix or a data.frame (or data.table).

Example

Try different .combine options in the following code.