Simulating Generalized Roy Models
Christophe Bruneel-Zupanc
Last modified: 2024-07-22
simul_data.Rmd
To perform simulation exercise and check the quality of our
estimators, simul_data
simulates generalized Roy models
with semi-IVs. This note describes the exact model that is simulated by
the function. It allows for quite flexible models, with very general
treatment effect heterogeneity. But one can also use it to simulate
models with homogenous treatment effects, or even more standard models
where the semi-IVs are valid IVs.
The Generalized Roy Model
This function simulates a generalized Roy model as described in Bruneel-Zupanc (2024).
Potential Outcomes. The potential outcomes (e.g., earnings) are given by:
where are the observed semi-IVs excluded from and respectively, is a vector of a binary (, e.g., location) and a continuous (, e.g., education of the parents) observable covariates, and are unobservable errors.
Selection Problem. We only observe the outcome
where represents the (binary) treatment decision (e.g., education choice), given by the following selection rule:
where is the main unobservable probability shock, and the higher , the more likely one is to be treated. Note that we normalize to get the normalized probability shock . can be interpreted as unobserved resistance to treatment. The closer is to 0, the more likely the individual is to be treated.
This specification yields that the probability of treatment is given by:
Thus, ceteris paribus, the lower , the higher the probability of treatment.
Unobservables specification
The specification of the unobservable depends on the model type.
Heterogenous treatment effects
For the general heterogeneous treatment effect model, we have:
library(semiIVreg)
#> KernSmooth 2.23 loaded
#> Copyright M. P. Wand 1997-2009
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:data.table':
#>
#> yearmon, yearqtr
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
# Example of general model with heterogenous treatment effects
N = 100000; set.seed(1234)
model_type = "heterogenous"
param_error = c(1, 1, 0.6, 0.5) # var_u0, var_u1, cov_u0u1, var_cost (the mean cost = constant in D*) # if heterogenous
param_Z = c(0, 0, 0, 0, 1.5, 1.5, 0.9) # meanW0 state0, meanW1 state0, meanW0 state1, meanW1 state1, varW0, varW1, covW0W1
param_p = c(0, -0.7, 0.7, 0, 0, 0) # constant, alphaW0, alphaW1, alphaW0W1, effect of state, effect of parent educ
param_y0 = c(3.2, 0.8, 0, 0) # intercept, effect of Wd, effect of state, effect of parent educ;
param_y1 = c(3.2+0.4, 0.5, 0, 0) # the +0.2 = Average treatment effect; effect of W1, effect of state, effect of parent educ;
param_genX = c(0.4, 0, 2)
data = simul_data(N, model_type, param_y0, param_y1, param_p, param_Z, param_genX, param_error)
Note that this is the specification that simulates the dataset
roydata
dataset available in the package, which can be
loaded using data(roydata)
.
Homogenous treatment effect
For the restricted homogeneous treatment effect model:
In both cases, is normally distributed, such that the selection equation is a probit model.
Covariates and Semi-IVs Specification. The covariates are generated by
The semi-IVs are -specific and are given by:
where the means and depend on the binary covariate .
# Model with homogenous treatment effects - not the same param_error to specify.
library(semiIVreg)
N = 10000; set.seed(1234)
model_type = "homogenous"
param_error = c(1, 1.5, -0.6) # var_u, var_v, cov_uv # if homogenous
param_Z = c(0, 0, 0, 0, 1.5, 1.5, 0.9) # meanW0 state0, meanW1 state0, meanW0 state1, meanW1 state1, varW0, varW1, covW0W1
param_p = c(0, -0.5, 0.5, 0, 0, 0) # constant, alphaW0, alphaW1, alphaW0W1, effect of state, effect of parent educ
param_y0 = c(3.2, 0.8, 0, 0) # intercept, effect of Wd, effect of state, effect of parent educ;
param_y1 = c(3.2+0.4, 0.5, 0, 0) # the +0.2 = Average treatment effect; effect of W1, effect of state, effect of parent educ;
param_genX = c(0.4, 0, 2) # probability state=1 (instead of 0), mean_parenteduc, sd_parenteduc (parenteduc drawn as continuous)
data = simul_data(N, model_type, param_y0, param_y1, param_p, param_Z, param_genX, param_error)
This is the function that is used to simulate the dataset
roydata2
available in the package, that can be loaded using
data(roydata2)
.
Simulating Standard IV MTE Models
This function can be used to model problems with IVs used to estimate Marginal Treatment Effects, by setting the effect of the semi-IVs on their respective outcomes to zero. The coefficients can be adjusted to mimic the Roy models of James J. Heckman, Urzua, and Vytlacil (2006), or James J. Heckman and Vytlacil (2007). Small adjustments inside the function allow mimicking the simulation of Andresen (2018) (mtefe in Stata), but with only 2 regions (state).
# Example of generalized Roy Model where the semi-IVs are valid IVs
N = 50000; set.seed(1234)
model_type = "heterogenous"
param_error = c(1, 1, 0.6, 0.5) # var_u0, var_u1, cov_u0u1, var_cost (the mean cost = constant in D*) # if heterogenous
param_Z = c(0, 0, 0, 0, 1.5, 1.5, 0.9) # meanW0 state0, meanW1 state0, meanW0 state1, meanW1 state1, varW0, varW1, covW0W1
param_p = c(0, -0.7, 0.7, 0, 0, 0) # constant, alphaW0, alphaW1, alphaW0W1, effect of state, effect of parent educ
param_y0 = c(3.2, 0, 0, 0) # intercept, effect of Wd, effect of state, effect of parent educ;
param_y1 = c(3.2+0.4, 0, 0, 0) # the +0.2 = Average treatment effect; effect of W1, effect of state, effect of parent educ;
param_genX = c(0.4, 0, 2)
data = simul_data(N, model_type, param_y0, param_y1, param_p, param_Z, param_genX, param_error)
param_y0[2]; # W0 is a valid IV because no direct effect on Y0
#> [1] 0
param_y1[2] # W1 is a valid IV because no direct effect on Y1
#> [1] 0