Semi-IV Regression Function — semiivreg • semiIVreg

Semi-IV regression function from Bruneel-Zupanc (2024). Syntax inspired from ivreg. Returns MTE and MTR curves with confidence intervals. The estimation is almost instantaneous (a few seconds at most).

By default, return analytic standard errors not accounting for the fact that the propensity score is estimated in a first stage in semiivreg. Use semiivreg_boot to obtain 'correct' bootstrapped confidence intervals (takes a bit longer).

Usage

semiivreg(formula, data, propensity_formula=NULL, propensity_data = NULL,
                 ref_indiv =NULL, firststage_model = "probit",
                 est_method = "locpoly", # "locpoly", "sieve", or "homogenous".
                 se_type = "HC1",
                 bw0 = NULL, bw1 = NULL, bw_y0 = NULL, bw_y1 = NULL, bw_method = 1/5,
                 kernel="gaussian",
                 pol_degree_locpoly1 = 1, pol_degree_locpoly2 = 2,
                 pol_degree_sieve = 5, conf_level = 0.05,
                 common_supp_trim=c(0,1), trimming_value=NULL, automatic_trim=FALSE,
                 plotting=TRUE, print_progress=FALSE, print_se_message=FALSE)

semiivreg_boot(formula, Nboot=500, data, propensity_formula=NULL, ref_indiv =NULL,
               firststage_model="probit", est_method = "locpoly", se_type="HC1",
               bw0 = NULL, bw1 = NULL, bw_y0 = NULL, bw_y1 = NULL, bw_method = "rule-of-thumb",
               pol_degree_locpoly1 = 1, pol_degree_locpoly2 = 2,
               common_supp_trim=c(0,1), trimming_value = NULL,
               automatic_trim = FALSE, plotting=TRUE, conf_level = 0.05, CI_method = "curve", weight_var)

semiiv_predict(semiiv, newdata, seq_v=NULL)

Arguments

formula: Formula of the regression, of the form outcome ~ treatment | semi-iv0 | semi-iv1 | commoncovariates.
The treatment variable should be binary (0, 1).
covariates with an effect that differs on D=1 and D=0 should be included in each semi-iv0 and semi-iv1.
with est_method = "locpoly": cannot restrict covariates to have common effects (not implemented), so commoncovariates will just be estimated as having generally a different effect on Y0 and Y1.
data: Dataframe containing the data.
propensity_formula: Formula for the 1st stage. If nothing specified, just runs a probit of d ~ semi-iv0 + semi-iv1 + covariates (removing the redundant variables).
propensity_data: Data used to compute the 1st stage; ignore by default set to NULL and = data. Mainly useful for internal bootstrap function is the first stage formula is different from the default one.
ref_indiv: Specify the reference individual (in terms of covariates) at which we will evaluate the function.
By default takes the average value for all the covariates (on the trimmed dataset) to compute the average estimate. Remark: for factors, the average is computed on the dummy variables to get the proper average effect.
firststage_model: By default, the first stage is a probit model. Can specify another model (e.g., "logit").
est_method: Estimation method: default is "locpoly" for Robinson (1988) double residual regression for partially linear model. Other options include "sieve" to specify flexibly the control function as a polynomial with pol_degree_sieve, and "homogenous" which is a sieve where we also impose homogenous treatment effect.
bw0, bw1: Bandwidth of the first residual regressions of (Y, Wd and X) on Phat.
Two possibilities: specify one value that is applied to all covariates (and Y), or specify a different bandwidth for the regression on each covariate. In the second case, need to be specified in the order of the covariates as specified in the model. Be very careful with factors.
Default NULL and computed using the specified bw_method. Ideally, if one factor covariate, apply the same bandwidth to all of the dummies created from the factor.
bw_y0, bw_y1: Bandwidth of the second regression of Y (net of the effects of the covariates) on Phat. Default NULL and computed using the specified bw_method.
bw_method: Method to compute the bandwidth of the local polynomial regressions (of the first-order derivative). Default option is 1/5, which arbitrarily sets bw0, bw1, bw_y0 and bw_y1 to 1/5th of the support (rounded to the 3th digit). Can place any fraction < 1.
Recommended alternatives include (global constant) bandwidth computations from nprobust package (Calonico, Cattaneo and Farrell, 2019) (i) "mse-dpi": direct plug-in MSE optimal bandwidth from Fan and Gijbels (1996). (ii) "mse-rot": rule-of-thumb implementation of the MSE-optimal bandwidth. These two methods take long with large sample: use bw_subsamp_size to speed up the computation.
kernel: Kernel to use for the local polynomial regressions. Default is "gaussian" but can be "epanechnikov". Takes longer with Epanechnikov (cannot use fast locpoly implementation from KernSmooth).
bw_subsamp_size: Size of the subsample to use for the bandwidth selection. Default is 10,000. Use bw_subsamp_size = NULL to use the full sample (may take time). Otherwise, recommend to set a number around 20,000 at most for reasonable computation time (exponentially increasing with sample size).
bw_subsamp_size introduces some randomness into the bandwidth selection procedure: recommended to set a seed before running semiivreg for reproducibility.
pol_degree_locpoly1: Degree of the local polynomial regression of the covariates on Phat. Default is 1 as recommended by Fan and Gijbels (1996) because we want to estimate the regular function.
pol_degree_locpoly2: Degree of the local polynomial regression of Y (net of the effects of the covariates) on Phat. Default is 2 as recommended by Fan and Gijbels (1996) because we want to estimate the derivative function.
fast_robinson2: If TRUE, use locpoly from KernSmooth during the bootstrap. Default is TRUE to speed things up (because do not need to compute standard errors in bootstrap) Set to FALSE if want to use epanechnikov kernel, or if want to use weights.
pol_degree_sieve: Degree of the polynomial transformation for the control function.
se_type: Type of standard errors in main estimation and in each bootstrap replication. Can simplify by setting "nonrobust" which goes (slightly) faster.
conf_level: Confidence level for the confidence intervals.
common_supp_trim: Vector of two values indicating the set of propensity scores at which we will evaluate the function.
Default is the full support [0,1]. But can be trimmed manually.
trimming_value: Can either be a vector c(0.05, 0.95) indicating the quantile of the propensity score above which and below which we keep the observations for both D=0 and D=1.
Can also be a single value, in which case symmetric trimming up and down.
Inserting a trimming_value generates automatic_trim = TRUE automatically.
automatic_trim: If TRUE, the estimation of the second stage is done on the common_support only.
weight_var: A variable of weights to be applied to the observations. Default is NULL, apply equal weights to all observations.
Implemented completely for est_method = "sieve" for now. For locpoly, the weights are not used when computing the "optimal bandwidth".
plotting: TRUE if wants to plot at the end of the function, FALSE otherwise.
print_progress: TRUE if wants to print the progress of the function, FALSE otherwise (default=FALSE).
print_se_message: TRUE if wants to print a message about the standard errors at the end of the function, FALSE otherwise (default=TRUE).
Nboot: Number of bootstrap samples.
block_boot_var: Variable on which to base the block bootstrap. By default, = NULL for standard bootstrap.
CI_method: "delta" for delta method, "curve" for bootstrap the MTE curves directly. With est_method = "locpoly", only "curve" method is possible.
print_progress_main: Print progress of the main estimation or not.
semiiv: Object returns from a semiivreg estimation.
newdata: New data for which to predict the MTE and MTR.
seq_v: Sequence of v at which to predict the MTE and MTR. By default: NULL fits the default interval of the original semiivreg (equally space grid of proba, with step size of 0.001 on the common support).
`fast_robinson1`: Default is TRUE to speed things up in a first stage (if many covariates in particular). If TRUE, will use the locpoly function from Kernsmooth library to speed up the computation of the Robinson double residual first stage. This is only possible if no external weights are used. Fast Locpoly will enforce a gaussian kernel.
`fast_robinson2`: Default is FALSE. If TRUE, will use the locpoly function from Kernsmooth library to speed up the computation of the Robinson double residual second stage. This is only possible if no external weights are used. Fast Locpoly will enforce a gaussian kernel. Default is FALSE for the second stage because fast_locpoly returns no standard errors and the gain in time is not so important for the second stage.

Value

A list with the following elements:

$data

Returns data of output estimation used to plot the MTE and MTR. In details:

$RES: Dataframe with the estimated MTE and MTR values (and their confidence intervals if est_method="sieve" or "homogenous") for a sequence of unobservable resistance to treatment in the identifiable common support.
$data: Original data used for the estimation where we added the propensity score estimated, named Phat, and where we made the transformation of the eventual factor variables as dummies.
$ref_indiv: Reference individual(s) at which we evaluate the MTE and MTR.
$Xdat: Set of covariates (this output is used for the bootstrap).
$deltaX: Returns the estimated effects of the covariates and semi-IVs (without intercept) for the specified reference individuals.

$estimate

Returns the estimation of:

$est, or $est0 and $est1: If est_method = "locpoly", est0 and est1 returns the second stage estimates of the effect of the covariates and semi-IVs on their respective potential outcomes. Coming out of the double residual regression à la Robinson (1988), running a no-intercept OLS of the residuals Y-E(Yd|P) on the residuals of every semi-IVs, Wd-E(Wd|P), and covariates, X-E(X|P).
$mtr0, $mtr1 and $mte: If est_method = "sieve" or "homogenous", returns the functional form estimated for both MTR and MTE.
$kv: Returns the estimated k_d(v) (=E(Ud|V=v)). Includes the constant. If sums with the effect of covariates and semi-IVs (deltadX), gives the mtr_d.
$propensity: First stage estimate of the propensity score.
$est_kappa: If est_method = "sieve" or "homogenous", this returns the estimated model for E(Y|D=d, X, Wd, P). From this, we extract Kappad(P) = E(Ud | D=d, P=p) from which we compute the kd(v) and mtrd(v, x, wd) functions.
$avg_MTE: Average of the MTE over the identified common support. If full common support, it is an estimate of the ATE(x, w0, w1). If est_method="homogenous", the MTE is constant so it also gives the ATE(x, w0, w1).

$bw

Returns the bandwidth used (or estimated via bw_method) in the Robinson double residual regression.
bw0 and bw1 are the bandwidths of the first residual regressions of Yd, Wd and X on Phat.
bw_y0 and bw_y1 are the bandwidths of the second regression of Y (net of the effects of the covariates) on Phat. These are the one that matters for the smoothness of the MTE and MTR estimates.

$plot

Returns separately the following plot objects: supp (support), mtr, mte and mte2. mte reports the estimation from "local IV" approach, with standard errors from the Robinson 2nd stage. mte2 reports the MTE estimated as the difference between the MTR (without standard errors).

$supp

Returns the common support of the propensity score Phat between the two treatment group.

$call

Returns the call of the function and the covariates and semi-IVs used.

The estimated model

semiivreg estimates the marginal treatment effect (MTE) and marginal treatment response (MTR) of a binary treatment variable using semi-IVs, W0 and W1. As with standard IVs (see Andresen, 2018), we estimate a semi-parametric partially linear model, as described in Bruneel-Zupanc (2024). For more details on the model and estimation procedure, see the vignette vignette("semiIVreg", package = "semiIVreg"), also available online here. For more details on the use of the semiivreg function, see also the vignettes vignette("semiIVreg_heterogenousTE", package = "semiIVreg") and vignette("semiIVreg_homogenousTE", package = "semiIVreg"). For more details about causal inference with semi-IVs in general, see Bruneel-Zupanc (2024).

Caution about the Estimated Standard errors

By default, est_method="locpoly" returns no standard errors.
If est_method="sieve" or est_method="homogenous", it returns analytic standard errors: but these are wrong because they do not account for the fact that the propensity score is estimated.
In any case, we recommend to use semiivreg_boot to obtain 'correct' bootstrapped confidence intervals. Implemented separately because the bootstrap takes more time, while the baseline semiivreg function is almost instantaneous.

References

Bruneel-Zupanc, C. (2024). Don't (fully) exclude me, it's not necessary! Identification with semi-IVs. arXiv preprint arXiv:2303.12667.

For empirical applications of the estimation of Marginal Treatment Effects with standard IVs, see for example:
Carneiro, P., Heckman, J. J., & Vytlacil, E. J. (2011). Estimating marginal returns to education. American Economic Review, 101(6), 2754-2781.

Brinch, C. N., Mogstad, M., & Wiswall, M. (2017). Beyond LATE with a discrete instrument. Journal of Political Economy, 125(4), 985-1039.

In particular, see Andresen, M. E. (2018). Exploring marginal treatment effects: Flexible estimation using Stata. The Stata Journal, 18(1), 118-158.

For double residual estimation of partially Linear models, see Robinson, P. M. (1988). Root-N-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, 931-954.

For local polynomial regressions choice of degree & Bandwidth computation Fan, J., & Gijbels, I. (1996). Local polynomial modelling and its applications. Calonico, S., Cattaneo, M. D., & Farrell, M. H. (2019). nprobust: Nonparametric Kernel-Based Estimation and Robust Bias-Corrected Inference. Journal of Statistical Software, 91(8), 1–33. https://doi.org/10.18637/jss.v091.i08

Author

Christophe Bruneel-Zupanc, cbruneel.com

Examples

# Load data:
data(roydata)

# Run the semi-IV regression
semiiv = semiivreg(y~d|w0|w1, data=roydata)

semiiv = semiivreg(y~d|w0|w1|Xbinary + Xcontinuous, data=roydata) # with covariates

semiiv = semiivreg(y~d|w0+Xbinary|w1+Xbinary|Xcontinuous, data=roydata)

# Xbinary has different effect on Y0 and Y1, Xcontinuous has the same.
semiiv = semiivreg(y~d|w0|w1, data=roydata, propensity_formula = d~w0+w1+w0:w1)

# if want to specify another first stage

semiiv$plot$mtr # if want to plot mtr_plot