Skip to contents

To perform simulation exercise and check the quality of our estimators, simul_data simulates generalized Roy models with semi-IVs. This note describes the exact model that is simulated by the function. It allows for quite flexible models, with very general treatment effect heterogeneity. But one can also use it to simulate models with homogenous treatment effects, or even more standard models where the semi-IVs are valid IVs.

The Generalized Roy Model

This function simulates a generalized Roy model as described in Bruneel-Zupanc (2024).

Potential Outcomes. The potential outcomes (e.g., earnings) are given by:

Y0=δ0+β0W0+Xβ0X+U0, Y_0 = \delta_{0} + \beta_{0} W_0 + X \beta_{0X} + U_0,

Y1=δ1+β1W1+Xβ1X+U1, Y_1 = \delta_{1} + \beta_{1} W_1 + X \beta_{1X} + U_1,

where W0,W1W_0, W_1 are the observed semi-IVs excluded from Y1Y_1 and Y0Y_0 respectively, X=(X1,X2)X=(X_1, X_2) is a vector of a binary (X1X_1, e.g., location) and a continuous (X2X_2, e.g., education of the parents) observable covariates, and U0,U1U_0, U_1 are unobservable errors.

Selection Problem. We only observe the outcome

Y=(1D)Y0+DY1. Y = (1-D) Y_0 + D Y_1.

where DD represents the (binary) treatment decision (e.g., education choice), given by the following selection rule:

D*=g(W0,W1,X)V=(α+α0W0+α1W1+α2W0×W1+αX1X1+αX2X2)V, with D=𝕀(D*>0), \begin{aligned} D^* &= g(W_0, W_1, X) - V \\ &= ( \alpha + \alpha_0 W_0 + \alpha_1 W_1 + \alpha_2 W_0 \times W_1 + \alpha_{X_1} X_1 + \alpha_{X_2} X_2) - V, \\ \text{ with } \quad D &= \mathbb{I}(D^* > 0), \end{aligned}

where VV is the main unobservable probability shock, and the higher VV, the more likely one is to be treated. Note that we normalize UD=FV(V)U_D=F_{V}(V) to get the normalized probability shock UD𝒰(0,1)U_D \sim \mathcal{U}(0, 1). UDU_D can be interpreted as unobserved resistance to treatment. The closer UDU_D is to 0, the more likely the individual is to be treated.

This specification yields that the probability of treatment is given by:

Pr(D=1|W0,W1,X)=Pr(V<g(W0,W1,X)). \textrm{Pr}(D=1 | W_0, W_1, X) = \textrm{Pr}(V < g(W_0, W_1, X)).

Thus, ceteris paribus, the lower gg, the higher the probability of treatment.

Unobservables specification

The specification of the unobservable depends on the model type.

Heterogenous treatment effects

For the general heterogeneous treatment effect model, we have:

(U0U1)𝒩((00),(σU02σU0U1σU0U1σU12)), \begin{pmatrix} U_0 \\ U_1 \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} \sigma^2_{U0} & \sigma_{U0U1} \\ \sigma_{U0U1} & \sigma^2_{U1} \end{pmatrix} \right),

C𝒩(μcost,σcost2), C \sim \mathcal{N}(\mu_{\text{cost}}, \sigma^2_{\text{cost}}),

V=(U1U0C). V = -(U_1 - U_0 - C).

library(semiIVreg)
#> KernSmooth 2.23 loaded
#> Copyright M. P. Wand 1997-2009
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:data.table':
#> 
#>     yearmon, yearqtr
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
# Example of general model with heterogenous treatment effects
N = 100000; set.seed(1234)
model_type = "heterogenous"
param_error = c(1, 1, 0.6, 0.5) # var_u0, var_u1, cov_u0u1, var_cost (the mean cost = constant in D*) # if heterogenous
param_Z = c(0, 0, 0, 0, 1.5, 1.5, 0.9) # meanW0 state0, meanW1 state0, meanW0 state1, meanW1 state1, varW0, varW1, covW0W1
param_p = c(0, -0.7, 0.7, 0, 0, 0) # constant, alphaW0, alphaW1, alphaW0W1, effect of state, effect of parent educ
param_y0 = c(3.2, 0.8, 0, 0) # intercept, effect of Wd, effect of state, effect of parent educ;
param_y1 = c(3.2+0.4, 0.5, 0, 0) # the +0.2 = Average treatment effect; effect of W1, effect of state, effect of parent educ;
param_genX = c(0.4, 0, 2)

data = simul_data(N, model_type, param_y0, param_y1, param_p, param_Z, param_genX, param_error)

Note that this is the specification that simulates the dataset roydata dataset available in the package, which can be loaded using data(roydata).

Homogenous treatment effect

For the restricted homogeneous treatment effect model:

(UV)𝒩((0μV),(σU2σUVσUVσV2)), \begin{pmatrix} U \\ V \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} 0 \\ \mu_{V} \end{pmatrix}, \begin{pmatrix} \sigma^2_{U} & \sigma_{UV} \\ \sigma_{UV} & \sigma^2_{V} \end{pmatrix} \right),

U0=U1=U. U_0 = U_1 = U.

In both cases, VV is normally distributed, such that the selection equation is a probit model.

Covariates and Semi-IVs Specification. The covariates are generated by

X1Bernoulli(pX1) and X2𝒩(μX2,σX22). X_1 \sim \text{Bernoulli}(p_{X_1}) \text{ and } X_2 \sim \mathcal{N}(\mu_{X_2}, \sigma^2_{X_2}).

The semi-IVs are X1X_1-specific and are given by:

(W0W1)𝒩((μW0,x1μW1,x1),(σW02σW0W1σW0W1σW12)), \begin{pmatrix} W_0 \\ W_1 \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mu_{W0,x_1} \\ \mu_{W1,x_1} \end{pmatrix}, \begin{pmatrix} \sigma^2_{W0} & \sigma_{W0W1} \\ \sigma_{W0W1} & \sigma^2_{W1} \end{pmatrix} \right),

where the means μW0,x1\mu_{W0,x_1} and μW1,x1\mu_{W1,x_1} depend on the binary covariate X1=x1X_1=x_1.

# Model with homogenous treatment effects - not the same param_error to specify. 
library(semiIVreg)
N = 10000; set.seed(1234)
model_type = "homogenous"
param_error = c(1, 1.5, -0.6) # var_u, var_v, cov_uv # if homogenous
param_Z = c(0, 0, 0, 0, 1.5, 1.5, 0.9) # meanW0 state0, meanW1 state0, meanW0 state1, meanW1 state1, varW0, varW1, covW0W1
param_p = c(0, -0.5, 0.5, 0, 0, 0) # constant, alphaW0, alphaW1, alphaW0W1, effect of state, effect of parent educ
param_y0 = c(3.2, 0.8, 0, 0) # intercept, effect of Wd, effect of state, effect of parent educ;
param_y1 = c(3.2+0.4, 0.5, 0, 0) # the +0.2 = Average treatment effect; effect of W1, effect of state, effect of parent educ;
param_genX = c(0.4, 0, 2) # probability state=1 (instead of 0), mean_parenteduc, sd_parenteduc (parenteduc drawn as continuous)

data = simul_data(N, model_type, param_y0, param_y1, param_p, param_Z, param_genX, param_error)

This is the function that is used to simulate the dataset roydata2 available in the package, that can be loaded using data(roydata2).

Simulating Standard IV MTE Models

This function can be used to model problems with IVs used to estimate Marginal Treatment Effects, by setting the effect of the semi-IVs on their respective outcomes to zero. The coefficients can be adjusted to mimic the Roy models of James J. Heckman, Urzua, and Vytlacil (2006), or James J. Heckman and Vytlacil (2007). Small adjustments inside the function allow mimicking the simulation of Andresen (2018) (mtefe in Stata), but with only 2 regions (state).

# Example of generalized Roy Model where the semi-IVs are valid IVs
N = 50000; set.seed(1234)
model_type = "heterogenous"
param_error = c(1, 1, 0.6, 0.5) # var_u0, var_u1, cov_u0u1, var_cost (the mean cost = constant in D*) # if heterogenous
param_Z = c(0, 0, 0, 0, 1.5, 1.5, 0.9) # meanW0 state0, meanW1 state0, meanW0 state1, meanW1 state1, varW0, varW1, covW0W1
param_p = c(0, -0.7, 0.7, 0, 0, 0) # constant, alphaW0, alphaW1, alphaW0W1, effect of state, effect of parent educ
param_y0 = c(3.2, 0, 0, 0) # intercept, effect of Wd, effect of state, effect of parent educ;
param_y1 = c(3.2+0.4, 0, 0, 0) # the +0.2 = Average treatment effect; effect of W1, effect of state, effect of parent educ;
param_genX = c(0.4, 0, 2)

data = simul_data(N, model_type, param_y0, param_y1, param_p, param_Z, param_genX, param_error)

param_y0[2]; # W0 is a valid IV because no direct effect on Y0
#> [1] 0
param_y1[2] # W1 is a valid IV because no direct effect on Y1
#> [1] 0

References

Andresen, Martin Eckhoff. 2018. “Exploring Marginal Treatment Effects: Flexible Estimation Using Stata.” The Stata Journal 18 (1): 118–58.
Bruneel-Zupanc, Christophe. 2024. “Don’t (Fully) Exclude Me, It’s Not Necessary! Identification with Semi-IVs.” https://arxiv.org/abs/2303.12667.
Heckman, James J, Sergio Urzua, and Edward Vytlacil. 2006. “Understanding Instrumental Variables in Models with Essential Heterogeneity.” The Review of Economics and Statistics 88 (3): 389–432.
Heckman, James J., and Edward J. Vytlacil. 2007. “Chapter 71 Econometric Evaluation of Social Programs, Part II: Using the Marginal Treatment Effect to Organize Alternative Econometric Estimators to Evaluate Social Programs, and to Forecast Their Effects in New Environments.” In, edited by James J. Heckman and Edward E. Leamer, 6:4875–5143. Handbook of Econometrics. Elsevier. https://doi.org/https://doi.org/10.1016/S1573-4412(07)06071-0.