Skip to contents

Let us compare the performance of the semi-IV estimator under different violation of the standard IV assumptions. This note mimics the Homogenous Treatment Effect Appendix on Bruneel-Zupanc (2024).

Simulating data

We simulate generalized Roy models using the simul_data()function. See the documentation of the function for details about the model. Depending on the chosen parameters, we can simulate a model with homogenous/heterogenous treatment effects, as well as with valid IVs eventually. That’s what we will do here. In every simulation we do not include covariates (set all their effect to 0), but these can be easily included.

Model with homogenous treatment effects

Simulate data

# Specification 1
library(semiIVreg)
#> KernSmooth 2.23 loaded
#> Copyright M. P. Wand 1997-2009
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:data.table':
#> 
#>     yearmon, yearqtr
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
N = 10000; set.seed(1234)
model_type = "homogenous"
param_error = c(1, 1, 0.6) # var_u, var_v, cov_uv # if homogenous
param_Z = c(0, 0, 0, 0, 1, 1, 0.5) # meanW0 state0, meanW1 state0, meanW0 state1, meanW1 state1, varW0, varW1, covW0W1
param_p = c(0, -0.7, 0.7, 0, 0, 0) # constant, alphaW0, alphaW1, alphaW0W1, effect of state, effect of parent educ
param_y0 = c(3.2, 0.8, 0, 0) # intercept, effect of Wd, effect of state, effect of parent educ;
param_y1 = c(3.2+0.4, 0.5, 0, 0) # the +0.2 = Average treatment effect; effect of W1, effect of state, effect of parent educ;
param_genX = c(0.4, 0, 2) # probability state=1 (instead of 0), mean_parenteduc, sd_parenteduc (parenteduc drawn as continuous)
# Remark: mean_V and the constant in the probit are playing the same role; normalize one to zero.
param = list(param_error, param_Z, param_p, param_y0, param_y1, param_genX, model_type)

data = simul_data(N, model_type, param_y0, param_y1, param_p, param_Z, param_genX, param_error)

This is a model with homogenous Treatment Effects (conditional on W0,W1W_0, W_1).

True unobserved homogenous Treatment Effects

Here, at (W0,W1)=(0,0)(W_0, W_1) = (0, 0), the effect is always 0.4 (see the parameters). To see this, let us just use the true unobserved potential outcomes (only observed thanks to the simulation).

data$true_TE = data$y1 - data$y0
summary(lm(true_TE ~ w0 + w1, data))
#> 
#> Call:
#> lm(formula = true_TE ~ w0 + w1, data = data)
#> 
#> Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -1.065e-13 -2.090e-16 -2.000e-18  1.980e-16  1.602e-13 
#> 
#> Coefficients:
#>               Estimate Std. Error            t value Pr(>|t|)    
#> (Intercept)  4.000e-01  1.956e-17  20453045425901668   <2e-16 ***
#> w0          -8.000e-01  2.248e-17 -35595035443326064   <2e-16 ***
#> w1           5.000e-01  2.259e-17  22130078856903196   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.956e-15 on 9997 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 6.463e+32 on 2 and 9997 DF,  p-value: < 2.2e-16
true_param = c(param_y0[1], param_y1[1] - param_y0[1], param_y0[2], param_y1[2]); true_param
#> [1] 3.2 0.4 0.8 0.5
# constant y0, constant y1 - constant y0, effects of w0 on y0, effects of w1 on y1.

This is the true model that we would like to recover. Note that it is a perfect fit because it is exactly the same error term in both potential outcomes, U0=U1=UU_0 = U_1 = U. If we had an additional error term but uncorrelated with the rest, we would estimate the same coefficients (but with more noise).

Naive OLS

naive_ols = lm(y ~ d + w0 + w1, data); summary(naive_ols)
#> 
#> Call:
#> lm(formula = y ~ d + w0 + w1, data = data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.0172 -0.6411 -0.0147  0.6318  4.1403 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  3.83027    0.01410  271.58   <2e-16 ***
#> d           -0.54983    0.02117  -25.98   <2e-16 ***
#> w0           0.20181    0.01185   17.04   <2e-16 ***
#> w1           0.45476    0.01189   38.24   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9406 on 9996 degrees of freedom
#> Multiple R-squared:  0.3026, Adjusted R-squared:  0.3024 
#> F-statistic:  1446 on 3 and 9996 DF,  p-value: < 2.2e-16

Obviously, the naive OLS estimator is biased, because there is endogeneity: UU is correlated with DD and YY. It overestimates the effect of the treatment DD to be -1. This is because individuals with high UU both select themself more into D=1D=1 and have a higher YY. Indeed E(U|D=1)>E(U|D=0)E(U | D=1) > E(U | D=0), and UU is negatively correlated with VV (cf the covariance parameter in param_error), so the higher UU, the lower VV, and the lower VV, the more likely one is to select D=1D=1.

mean(data$U1[which(data$d == 1)]); mean(data$U0[which(data$d==0)])
#> [1] -0.3762545
#> [1] 0.3864966

Wrongly assuming that the semi-IVs are valid IVs

What is we assume that W0W_0 and W1W_1 are valid IVs (i.e., that they have no direct effect on their respective potential outcomes)?

library(ivreg) # remark: ivreg is not required otherwise in semiivreg, only in this vignette. 
valid_iv = ivreg(y ~ d | w0 + w1 + w0:w1, data=data); summary(valid_iv)
#> 
#> Call:
#> ivreg(formula = y ~ d | w0 + w1 + w0:w1, data = data)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -3.909987 -0.756706 -0.003831  0.746416  4.813215 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  3.56922    0.02676 133.387   <2e-16 ***
#> d           -0.02044    0.04905  -0.417    0.677    
#> 
#> Diagnostic tests:
#>                   df1  df2 statistic p-value    
#> Weak instruments    3 9996     886.4  <2e-16 ***
#> Wu-Hausman          1 9997      96.0  <2e-16 ***
#> Sargan              2   NA    2604.4  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.124 on 9998 degrees of freedom
#> Multiple R-Squared: 0.003449,    Adjusted R-squared: 0.003349 
#> Wald test: 0.1736 on 1 and 9998 DF,  p-value: 0.6769

The estimated coefficients is also completely biased here, because it assumes the semi-IVs have no effect on the outcomes while they do.

semi-IV estimation with semiivreg

Let us now see how the semi-IV estimator performs. To specify the estimation with homogenous treatment effects, we specify est_method="homogenous". This performs a sieve-like estimation but with an additional constraint which restrict the MTE to be constant by imposing constraints on the control function κ0(p)\kappa_0(p) and κ1(p)\kappa_1(p).

semiiv = semiivreg(y~d|w0|w1, data, est_method="homogenous", plotting=FALSE)
summary_coeff = semiiv$estimate$est # extract the coefficients from the homogenous TE specification
summary_coeff[1:4,] # only print the first 4 coefficients, the other correspond to the control function of P
#>      Variable  Estimate  Std_Error    t_value       p_value
#> 1 (Intercept) 3.2453691 0.02408485 134.747305  0.000000e+00
#> 2        I(d) 0.3255329 0.04497677   7.237801  4.895871e-13
#> 3 I(1 - d):w0 0.7834206 0.01431262  54.736343  0.000000e+00
#> 4     I(d):w1 0.5194180 0.01432695  36.254620 1.689759e-270
true_param
#> [1] 3.2 0.4 0.8 0.5

The estimated coefficients are very close to the truth, with relatively small standard errors here, even though the sample size is modest (10000 observations).

Be cautious though: the standard errors are computed without taking into account the fact that the propensity score is estimated in a first stage. We can correct this estimation with semiivreg_boot(). But typically, if the first stage is well estimated, the bias in the standard errors is small, as visible below.

semiivboot = semiivreg_boot(y~d|w0|w1, data, Nboot=200, est_method="homogenous", plotting=FALSE) # reduce the number of bootstrap simulation for speed;  
#> Bandwidth and MTR/MTE estimation on main sample... 
#> Estimating first stage... First stage estimated.       
#> 2nd stage: Estimating MTR and MTE... 
#> Estimation complete.                                             
#> 
#> Bandwidth and MTR/MTE estimation on main sample: Done. 
#> Bootstrap Progress: 1/200 Bootstrap Progress: 2/200 Bootstrap Progress: 3/200 Bootstrap Progress: 4/200 Bootstrap Progress: 5/200 Bootstrap Progress: 6/200 Bootstrap Progress: 7/200 Bootstrap Progress: 8/200 Bootstrap Progress: 9/200 Bootstrap Progress: 10/200 Bootstrap Progress: 11/200 Bootstrap Progress: 12/200 Bootstrap Progress: 13/200 Bootstrap Progress: 14/200 Bootstrap Progress: 15/200 Bootstrap Progress: 16/200 Bootstrap Progress: 17/200 Bootstrap Progress: 18/200 Bootstrap Progress: 19/200 Bootstrap Progress: 20/200 Bootstrap Progress: 21/200 Bootstrap Progress: 22/200 Bootstrap Progress: 23/200 Bootstrap Progress: 24/200 Bootstrap Progress: 25/200 Bootstrap Progress: 26/200 Bootstrap Progress: 27/200 Bootstrap Progress: 28/200 Bootstrap Progress: 29/200 Bootstrap Progress: 30/200 Bootstrap Progress: 31/200 Bootstrap Progress: 32/200 Bootstrap Progress: 33/200 Bootstrap Progress: 34/200 Bootstrap Progress: 35/200 Bootstrap Progress: 36/200 Bootstrap Progress: 37/200 Bootstrap Progress: 38/200 Bootstrap Progress: 39/200 Bootstrap Progress: 40/200 Bootstrap Progress: 41/200 Bootstrap Progress: 42/200 Bootstrap Progress: 43/200 Bootstrap Progress: 44/200 Bootstrap Progress: 45/200 Bootstrap Progress: 46/200 Bootstrap Progress: 47/200 Bootstrap Progress: 48/200 Bootstrap Progress: 49/200 Bootstrap Progress: 50/200 Bootstrap Progress: 51/200 Bootstrap Progress: 52/200 Bootstrap Progress: 53/200 Bootstrap Progress: 54/200 Bootstrap Progress: 55/200 Bootstrap Progress: 56/200 Bootstrap Progress: 57/200 Bootstrap Progress: 58/200 Bootstrap Progress: 59/200 Bootstrap Progress: 60/200 Bootstrap Progress: 61/200 Bootstrap Progress: 62/200 Bootstrap Progress: 63/200 Bootstrap Progress: 64/200 Bootstrap Progress: 65/200 Bootstrap Progress: 66/200 Bootstrap Progress: 67/200 Bootstrap Progress: 68/200 Bootstrap Progress: 69/200 Bootstrap Progress: 70/200 Bootstrap Progress: 71/200 Bootstrap Progress: 72/200 Bootstrap Progress: 73/200 Bootstrap Progress: 74/200 Bootstrap Progress: 75/200 Bootstrap Progress: 76/200 Bootstrap Progress: 77/200 Bootstrap Progress: 78/200 Bootstrap Progress: 79/200 Bootstrap Progress: 80/200 Bootstrap Progress: 81/200 Bootstrap Progress: 82/200 Bootstrap Progress: 83/200 Bootstrap Progress: 84/200 Bootstrap Progress: 85/200 Bootstrap Progress: 86/200 Bootstrap Progress: 87/200 Bootstrap Progress: 88/200 Bootstrap Progress: 89/200 Bootstrap Progress: 90/200 Bootstrap Progress: 91/200 Bootstrap Progress: 92/200 Bootstrap Progress: 93/200 Bootstrap Progress: 94/200 Bootstrap Progress: 95/200 Bootstrap Progress: 96/200 Bootstrap Progress: 97/200 Bootstrap Progress: 98/200 Bootstrap Progress: 99/200 Bootstrap Progress: 100/200 Bootstrap Progress: 101/200 Bootstrap Progress: 102/200 Bootstrap Progress: 103/200 Bootstrap Progress: 104/200 Bootstrap Progress: 105/200 Bootstrap Progress: 106/200 Bootstrap Progress: 107/200 Bootstrap Progress: 108/200 Bootstrap Progress: 109/200 Bootstrap Progress: 110/200 Bootstrap Progress: 111/200 Bootstrap Progress: 112/200 Bootstrap Progress: 113/200 Bootstrap Progress: 114/200 Bootstrap Progress: 115/200 Bootstrap Progress: 116/200 Bootstrap Progress: 117/200 Bootstrap Progress: 118/200 Bootstrap Progress: 119/200 Bootstrap Progress: 120/200 Bootstrap Progress: 121/200 Bootstrap Progress: 122/200 Bootstrap Progress: 123/200 Bootstrap Progress: 124/200 Bootstrap Progress: 125/200 Bootstrap Progress: 126/200 Bootstrap Progress: 127/200 Bootstrap Progress: 128/200 Bootstrap Progress: 129/200 Bootstrap Progress: 130/200 Bootstrap Progress: 131/200 Bootstrap Progress: 132/200 Bootstrap Progress: 133/200 Bootstrap Progress: 134/200 Bootstrap Progress: 135/200 Bootstrap Progress: 136/200 Bootstrap Progress: 137/200 Bootstrap Progress: 138/200 Bootstrap Progress: 139/200 Bootstrap Progress: 140/200 Bootstrap Progress: 141/200 Bootstrap Progress: 142/200 Bootstrap Progress: 143/200 Bootstrap Progress: 144/200 Bootstrap Progress: 145/200 Bootstrap Progress: 146/200 Bootstrap Progress: 147/200 Bootstrap Progress: 148/200 Bootstrap Progress: 149/200 Bootstrap Progress: 150/200 Bootstrap Progress: 151/200 Bootstrap Progress: 152/200 Bootstrap Progress: 153/200 Bootstrap Progress: 154/200 Bootstrap Progress: 155/200 Bootstrap Progress: 156/200 Bootstrap Progress: 157/200 Bootstrap Progress: 158/200 Bootstrap Progress: 159/200 Bootstrap Progress: 160/200 Bootstrap Progress: 161/200 Bootstrap Progress: 162/200 Bootstrap Progress: 163/200 Bootstrap Progress: 164/200 Bootstrap Progress: 165/200 Bootstrap Progress: 166/200 Bootstrap Progress: 167/200 Bootstrap Progress: 168/200 Bootstrap Progress: 169/200 Bootstrap Progress: 170/200 Bootstrap Progress: 171/200 Bootstrap Progress: 172/200 Bootstrap Progress: 173/200 Bootstrap Progress: 174/200 Bootstrap Progress: 175/200 Bootstrap Progress: 176/200 Bootstrap Progress: 177/200 Bootstrap Progress: 178/200 Bootstrap Progress: 179/200 Bootstrap Progress: 180/200 Bootstrap Progress: 181/200 Bootstrap Progress: 182/200 Bootstrap Progress: 183/200 Bootstrap Progress: 184/200 Bootstrap Progress: 185/200 Bootstrap Progress: 186/200 Bootstrap Progress: 187/200 Bootstrap Progress: 188/200 Bootstrap Progress: 189/200 Bootstrap Progress: 190/200 Bootstrap Progress: 191/200 Bootstrap Progress: 192/200 Bootstrap Progress: 193/200 Bootstrap Progress: 194/200 Bootstrap Progress: 195/200 Bootstrap Progress: 196/200 Bootstrap Progress: 197/200 Bootstrap Progress: 198/200 Bootstrap Progress: 199/200 Bootstrap Progress: 200/200 
boot_se = semiivboot$estimate$coeff$std_error[1:4]
res = as.data.frame(cbind(summary_coeff[1:4,1:3], boot_se)); colnames(res) = c("Variable", "Estimate", "wrong analytic SE", "Bootstrapped SE")
res
#>      Variable  Estimate wrong analytic SE Bootstrapped SE
#> 1 (Intercept) 3.2453691        0.02408485      0.02632599
#> 2        I(d) 0.3255329        0.04497677      0.04930933
#> 3 I(1 - d):w0 0.7834206        0.01431262      0.01379810
#> 4     I(d):w1 0.5194180        0.01432695      0.01631556

Alternative semi-IV strategy: based on IV-quantile regression

As described in Bruneel-Zupanc (2024), there is another general nonparametric identification strategy with semi-IV, building on the IV-quantile regression (IVQR) framework of Chernozhukov and Hansen (2005). In the model with homogenous treatment effect, this strategy requires that some interaction W0×W1W_0 \times W_1 is relevant for the selection into treatment. The intuition is that this interaction now serves as IV for the treatment DD within this framework. The nice feature of this strategy is that it can be implemented with standard IV estimation, e.g., ivreg() in R.

semiivqr = ivreg(y~d+I(1-d):w0 + I(d):w1|w0+w1+w0:w1, data=data)
summary(semiivqr)
#> Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
#>      arithmetic operators in their names;
#>   the printed representation of the hypothesis will be omitted
#> 
#> Call:
#> ivreg(formula = y ~ d + I(1 - d):w0 + I(d):w1 | w0 + w1 + w0:w1, 
#>     data = data)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -21.7424  -4.4342  -0.0106   4.5084  23.1987 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)    8.663     18.635   0.465    0.642
#> d            -10.561     37.440  -0.282    0.778
#> I(1 - d):w0   -4.091     16.779  -0.244    0.807
#> I(d):w1        5.428     16.869   0.322    0.748
#> 
#> Diagnostic tests:
#>                                 df1  df2 statistic p-value    
#> Weak instruments (d)              3 9996     886.4  <2e-16 ***
#> Weak instruments (I(1 - d):w0)    3 9996    3691.7  <2e-16 ***
#> Weak instruments (I(d):w1)        3 9996    3560.3  <2e-16 ***
#> Wu-Hausman                        3 9993     107.9  <2e-16 ***
#> Sargan                            0   NA        NA      NA    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.212 on 9996 degrees of freedom
#> Multiple R-Squared: -29.42,  Adjusted R-squared: -29.43 
#> Wald test: 28.43 on 3 and 9996 DF,  p-value: < 2.2e-16

In this baseline model, the interaction is not significant in the first stage, so we have a problem of weak IVs. This is not the case in this baseline model, so the estimation blow up because the IV is irrelevant.

However, if we specify an alternative model where the interaction is indeed relevant, we will also estimate correctly the homogenous treatment effect using this strategy. This is what we do now.

# Model 2
N = 10000; set.seed(1234)
model_type = "homogenous"
param_error = c(1, 1.5, -0.6) # var_u, var_v, cov_uv # if homogenous
param_Z = c(0, 0, 0, 0, 1.5, 1.5, 0.9) # meanW0 state0, meanW1 state0, meanW0 state1, meanW1 state1, varW0, varW1, covW0W1
param_p = c(0, -0.3, 0.4, 0.3, 0, 0) # constant, alphaW0, alphaW1, alphaW0W1, effect of state, effect of parent educ
param_y0 = c(3.2, 0.8, 0, 0) # intercept, effect of Wd, effect of state, effect of parent educ;
param_y1 = c(3.2+0.4, 0.5, 0, 0) # the +0.2 = Average treatment effect; effect of W1, effect of state, effect of parent educ;
param_genX = c(0.4, 0, 2) # probability state=1 (instead of 0), mean_parenteduc, sd_parenteduc (parenteduc drawn as continuous)

data2 = simul_data(N, model_type, param_y0, param_y1, param_p, param_Z, param_genX, param_error)

This is a model param_p[4] gives the effect of W0×W1W_0\times W_1 on DD. It is different from 0 here. So the IVQR strategy should work.

semiivqr2 = ivreg(y~d+I(1-d):w0 + I(d):w1|w0+w1+w0:w1, data=data2); summary(semiivqr2)
#> Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
#>      arithmetic operators in their names;
#>   the printed representation of the hypothesis will be omitted
#> 
#> Call:
#> ivreg(formula = y ~ d + I(1 - d):w0 + I(d):w1 | w0 + w1 + w0:w1, 
#>     data = data2)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -3.590688 -0.667083 -0.006262  0.664900  3.860807 
#> 
#> Coefficients:
#>             Estimate Std. Error t value   Pr(>|t|)    
#> (Intercept)  3.18552    0.05274  60.405    < 2e-16 ***
#> d            0.42566    0.09107   4.674 0.00000299 ***
#> I(1 - d):w0  0.80693    0.04120  19.586    < 2e-16 ***
#> I(d):w1      0.49942    0.01849  27.004    < 2e-16 ***
#> 
#> Diagnostic tests:
#>                                 df1  df2 statistic p-value    
#> Weak instruments (d)              3 9996    411.88  <2e-16 ***
#> Weak instruments (I(1 - d):w0)    3 9996   1645.20  <2e-16 ***
#> Weak instruments (I(d):w1)        3 9996   7846.17  <2e-16 ***
#> Wu-Hausman                        3 9993     42.32  <2e-16 ***
#> Sargan                            0   NA        NA      NA    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9964 on 9996 degrees of freedom
#> Multiple R-Squared: 0.4283,  Adjusted R-squared: 0.4281 
#> Wald test:  1385 on 3 and 9996 DF,  p-value: < 2.2e-16

Indeed, it estimates the coefficients well. Notice that the semiivreg() function also works there:

semiiv2 = semiivreg(y~d|w0|w1, data=data2, est_method="homogenous", plotting=FALSE)
summary_coeff2 = semiiv2$estimate$est # extract the coefficients from the homogenous TE specification
summary_coeff2[1:4,] # only print the first 4 coefficients, the other correspond to the control function of P
#>      Variable  Estimate  Std_Error   t_value        p_value
#> 1 (Intercept) 3.1671559 0.05868908 53.964995 0.000000000000
#> 2        I(d) 0.4551946 0.10268723  4.432826 0.000009399856
#> 3 I(1 - d):w0 0.8244073 0.01517866 54.313584 0.000000000000
#> 4     I(d):w1 0.4952332 0.01192557 41.527014 0.000000000000
true_param
#> [1] 3.2 0.4 0.8 0.5

Thus semiivreg()is more flexible and should be used in any case. Notice that, in general the IVQR estimation will also have a higher variance of the estimate because it builds on the interaction as an IV so it requires it to have a large significant effects on the treatment to avoid weak IV concerns. While the semiivreg() only relies on the fact that each semi-IV is relevant by itself, which is relatively easier to satisfy. Especially if the semi-IVs have a strong impact on their respective potential outcome, they should be relevant as soon as there is selection into treatment based on gains (i.e., Y1Y0Y_1 - Y_0).

Remark: since we know the first stage includes the interaction, we might as well include it in semiivreg as follows (but the previous results not including where still ok, despite the bias in prediction of the propensity score because of the wrong formula)

semiiv3 = semiivreg(y~d|w0|w1, propensity_formula = d~w0+w1+w0:w1, 
                    data=data2, est_method="homogenous", plotting=FALSE)
fstage = semiiv3$estimate$propensity; summary(fstage) # returns the first stage estimation
#> 
#> Call:
#> glm(formula = propensity_formula, family = binomial(link = "probit"), 
#>     data = propensity_data, weights = WEIGHTS)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -0.01376    0.01483  -0.927    0.354    
#> w0          -0.23785    0.01392 -17.089   <2e-16 ***
#> w1           0.32054    0.01424  22.517   <2e-16 ***
#> w0:w1        0.25106    0.01025  24.482   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 13672  on 9999  degrees of freedom
#> Residual deviance: 12385  on 9996  degrees of freedom
#> AIC: 12393
#> 
#> Number of Fisher Scoring iterations: 4
summary_coeff3 = semiiv3$estimate$est # extract the coefficients from the homogenous TE specification
summary_coeff3[1:4,] # only print the first 4 coefficients, the other correspond to the control function of P
#>      Variable  Estimate  Std_Error   t_value      p_value
#> 1 (Intercept) 3.1768312 0.03566453 89.075380 0.000000e+00
#> 2        I(d) 0.4403100 0.06103333  7.214254 5.816986e-13
#> 3 I(1 - d):w0 0.8227038 0.01466641 56.094412 0.000000e+00
#> 4     I(d):w1 0.4985490 0.01020502 48.853314 0.000000e+00
true_param
#> [1] 3.2 0.4 0.8 0.5

Specification 2: semi-IVs are in fact valid IVs

To run the specification 2 of Bruneel-Zupanc (2024), where the semi-IVs are in fact valid IVs, we need to change the parameters of the model. We only need to remove (set to zero) the direct effect of the semi-IVs on their potential outcomes.

# Specification 2
param_y0 = c(3.2, 0, 0, 0) # intercept, effect of Wd, effect of state, effect of parent educ;
param_y1 = c(3.2+0.4, 0, 0, 0) # the +0.2 = Average treatment effect; effect of W1, effect of state, effect of parent educ;
param = list(param_error, param_Z, param_p, param_y0, param_y1, param_genX, model_type)

Then, simply run the previous code. To replicate the Monte Carlo simulations of the paper, run it several times.

References

Bruneel-Zupanc, Christophe. 2024. “Don’t (Fully) Exclude Me, It’s Not Necessary! Identification with Semi-IVs.” https://arxiv.org/abs/2303.12667.
Chernozhukov, Victor, and Christian Hansen. 2005. “An IV Model of Quantile Treatment Effects.” Econometrica 73 (1): 245–61.