--- title: "Introduction to the robust2sls Package" author: "Jonas Kurle" date: "11 June 2022" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to the robust2sls Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, include = FALSE, echo = FALSE} Sys.setenv(LANGUAGE="en") ``` # Overview The *robust2sls* package provides an implementation of several algorithms for estimating outlier robust two-stage least squares (2SLS) models. All currently implemented algorithms are based on trimmed 2SLS, i.e. procedures where certain observations are identified as outliers and subsequently removed from the sample for re-estimation. The algorithms can be customized using a range of arguments. After estimation, corrected standard errors can be computed, which take false outlier detection into account. Moreover, the *robust2sls* package provides statistical tests for whether (a subset of) the parameters between the full and the trimmed sample are statistically significant. Currently, tests for the presence of outliers in the original sample are being developed and will be added in the future. # Model setup The model setting is a standard 2SLS setup. Suppose we have cross-sectional iid data $\{(y_{i}, x_{i}, z_{i})\}$, where $y_{i}$ is the outcome variable, $x_{i}$ is a vector of regressors of which some can be endogenous, and $z_{i}$ is a vector of instruments. The set of regressors can be decomposed into the exogenous regressors, $x_{1i}$, and the endogenous regressors $x_{2i}$. The total dimension of $x_{i}$ is then $d_{x} = d_{x_{1}} + d_{x_{2}}$. The instruments $z_{i}$ consist of both the exogenous regressors, $z_{1i} = x_{1i}$, and a set of excluded instruments $z_{2i}$. The total dimension of $z_{i}$ is then $d_{z} = d_{x_{1}} + d_{z_{2}}$. The structural equation is: $$y_{i} = x_{i}^{\prime}\beta + u_{i} = x_{1i}^{\prime}\beta_{1} + x_{2i}^{\prime}\beta_{2} + u_{i},$$ where $\mathsf{E}(x_{1i}u_{i}) = 0_{d_{x_{1}}}$ for the exogenous regressors and $\mathsf{E}(x_{2i}u_{i}) \neq 0_{d_{x_{2}}}$ for the endogenous regressors. The first stage equation is: $$\begin{pmatrix} x_{1i} \\ x_{2i} \end{pmatrix} = \Pi^{\prime} \begin{pmatrix} x_{1i} \\ z_{2i} \end{pmatrix} + \begin{pmatrix} r_{1i} \\ r_{2i} \end{pmatrix} = \begin{pmatrix} I_{d_{x_{1}}} & 0_{d_{x_{1}} \times d_{z_{2}}} \\ \Pi_{1}^{\prime} & \Pi_{2}^{\prime} \end{pmatrix} \begin{pmatrix} x_{1i} \\ z_{2i} \end{pmatrix} + \begin{pmatrix} 0_{d_{x_{1}}} \\ r_{2i} \end{pmatrix}.$$ The instruments $z_{i}$ are assumed to be valid and informative (rank condition fulfilled). The projection errors for the exogenous regressors $r_{1i}$ are zero because the exogenous regressors are included themselves as regressors and can therefore be fit perfectly. # Outlier detection algorithms The implemented algorithms are all forms of trimmed 2SLS. Their basic principle is as follows: First, an initial estimate of the model is obtained and the standardized residuals are calculated for each observation. These residuals are then compared against a reference distribution that the researchers has chosen. Observations with an absolute standardized residual beyond the cut-off value are classified as outliers and removed from the sample. Next, the model is re-estimated on the trimmed sample of non-outlying observations. The procedure can end after the initial classification or can be iterated. The workhorse command for different types of trimmed 2SLS algorithms in the *robust2sls* package is `outlier_detection()`. The algorithms differ in the following respect: * which initial estimator to use * how the sample is trimmed, which is governed by * the reference distribution against which the errors are judged to be outliers or not * the cut-off value $c$ that determines the size of the standardized errors beyond which observations are labeled as outliers and subsequently removed * how often the algorithm is iterated, which is represented by the parameter $m$. The following subsections briefly explain the different choices. ## Initial estimator The initial estimator can be set in `outlier_detection()` through the `initial_est` argument. The most common choice for the initial estimator is to simply use the 2SLS estimator based on the full sample. The corresponding argument is `"robustified"`. Alternatively, the initial estimation can consist of two estimations based on two sub-samples. For that purpose, the sample is partitioned into two parts and separate models are estimated on each of them. However, the estimates of **one** sub-sample are used to calculate the residuals of observations in the **other** sub-sample. For example, the estimates of sub-sample 2 are used to calculate the residuals in sub-sample 1: $\widehat{u}_{i} = y_{i} - x_{i}^{\prime} \widehat{\beta}_{2}$ for observations $i$ in sub-sample 1 and where $\widehat{\beta}_{2}$ denotes the parameter estimate obtained from the second sub-sample. The corresponding argument is `"saturated"`. The `split` argument governs in what proportions the sample is split. Lastly, the user can specified their own initial estimator if the argument is set to `"user"`. In this case, a model object of class `ivreg` must be given to the `user_model` argument. ## Reference distribution Whether an observation is an outlier, that means unusual in a certain sense, can only be decided relative to a certain distribution - the reference distribution. While it is unusual to obtain a value of 5 under the standard normal distribution, $\mathcal{N}(0,1)$, it is not unusual for a normal distribution with mean 5, $\mathcal{N}(5,1)$, or a uniform distribution over the interval $[2,6]$, $\mathcal{U}_{[2,6]}$. The reference distribution is the distribution of the standardized structural error, i.e. of $u_{i}/\sigma$. Currently, `outlier_detection()` only allows for the normal distribution, which is the distribution that is usually chosen in practice. The theoretical foundation allows for a broader range of distributions as long as they fulfil certain moment conditions but they are not yet implemented. The argument `ref_dist` therefore currently only accepts the value `"normal"`. ## Cut-off or probability of outliers In addition to the reference distribution, the user needs to decide the cut-off value $c$, which is the value beyond which absolute standardized residuals are considered as unusual. This is linked to the probability of drawing a standardized residual larger than $c$ and therefore of classifying an observation as an outlier if there are actually no outliers in the sample (null hypothesis). The probability of falsely classifying an observation as an outlier when there are in fact none is called $\gamma_{c}$. Together with the reference distribution, it determines the cut-off value $c$. Suppose we assume a standard normal distribution of the standardized residuals. A cut-off value of $c = 1.96$ corresponds to a probability of false outlier detection of approximately 5%. Remember that we classify observations as outliers if the absolute value of the standardized residual is larger than the cut-off value $c$, i.e. for values that are larger than $1.96$ or smaller than $-1.96$. Instead, we could target the expected false outlier detection rate, for example by targeting 1%. Together with the reference distribution being standard normal, this yields a cut-off value of approximately $c = 2.58$. It is usually easier to think about the expected false outlier detection rate than the cut-off itself, which is why the user actually sets $\gamma_{c}$ in `outlier_detection()` using the argument `sign_level`. Remember that the expected false outlier detection rate refers to the scenario in which we have no actual outliers in the sample. This is our null hypothesis. It just happens that sometimes we have unusual draws of the errors, in accordance with the assumed reference distribution. ## Iteration The procedure of classifying observations as outlier with subsequent re-estimation of the model can be applied iteratively. The argument `iterations` determines whether the algorithm is iterated (value $\geq 1$) or not (value $0$). The algorithm can also be iterated until convergence, that is until the selected sub-sample of non-outlying observations does not change anymore or until the parameter estimates do not change much anymore. For this, the argument `iterations` must be set to `"convergence"`. The user can also set the `convergence_criterion` argument, which is the stopping criterion. If the L2 norm of the difference of the coefficients between iteration $m$ and $m-1$ is smaller than the criterion, the algorithm is terminated. The stopping criterion can also be used when `iterations` is set to a numeric value $m$, in which case the algorithm is iterated at most $m$ times but stops earlier if the stopping criterion is fulfilled. # Example without outliers In this section, we create a toy example using artificial data to showcase some of the different algorithms and statistical tests. We first load the necessary packages. ```{r, echo = FALSE} library(robust2sls) library(ivreg) # for 2sls regression ``` ## Data generating process (DGP) We choose a simple example with an intercept, one endogenous regressor, and one excluded instrument. The function `generate_param()` takes the dimensions of the elements of the 2SLS model and returns a parameter setup that fulfills the assumption, such as the validity and informativeness of the instruments. Based on the random parameter setup, we also create some random artificial data `generate_data()`, with which we will work throughout this section. We create a sample of 1,000 data points. ```{r} param <- generate_param(dx1 = 1, dx2 = 1, dz2 = 1, intercept = TRUE, beta = c(2,-1), sigma = 1, seed = 2) data_full <- generate_data(parameters = param, n = 1000)$data # have a look at the data head(data_full) ``` In reality, the researcher would only observe $y$, $x_{1}$ (intercept), $x_{2}$ (endogenous), and $z_{2}$ (instrument). We have created data from the following structural equation: $$ y_{i} = x_{i}^{\prime} \beta + u_{i} = \beta_{1} x_{1} + \beta_{2} x_{2} + u_{i} = 2 - x_{2} + u_{i}.$$ We can quickly check whether the assumptions are approximately fulfilled in our finite sample. ```{r} cor(data_full$u, data_full$x2) # correlation for endogenous regressor cor(data_full$u, data_full$z2) # close to zero correlation for valid instrument cor(data_full$x2, data_full$z2) # correlation for informativeness ``` We can also compare 2SLS to OLS and see how close their estimates are to the true parameters $\beta^{\prime} = \begin{pmatrix} 2 & -1 \end{pmatrix}$. ```{r} fml_ols <- y ~ x2 ols <- lm(formula = fml_ols, data = data_full[, 1:3]) print(ols$coefficients) fml_tsls <- y ~ x2 | z2 tsls <- ivreg(formula = fml_tsls, data = data_full[, c(1:3, 6)]) print(tsls$coefficients) ``` As expected, we clearly see that 2SLS provides estimates closer to the true DGP than OLS. ## Outlier detection We have generated the data without any outliers in the sense that none of the structural errors has been drawn from a different distribution, such as a fat-tailed distribution. In our setting, the structural error follows a marginal normal distribution with variance $\sigma^{2} = 1$, such that $u_{i} / \sigma \sim \mathcal{N}(0,1)$. So we are working under the null hypothesis of no outliers. ### Robustified 2SLS In our first example, we use the full sample 2SLS estimator as the initial estimator. Let us use an expected false outlier detection rate, $\gamma_{c}$, of 1% so that we expect to find $1000 \times 0.01 = 10$ outliers even though there are technically none in the sample. ```{r} # extract the variables we observe data <- data.frame(data_full[, c(1:3, 6)]) # not iterating the algorithm robustified_0 <- outlier_detection(data = data, formula = fml_tsls, ref_dist = "normal", sign_level = 0.01, initial_est = "robustified", iterations = 0) print(robustified_0) # iterating algorithm until convergence robustified_conv <- outlier_detection(data = data, formula = fml_tsls, ref_dist = "normal", sign_level = 0.01, initial_est = "robustified", iterations = "convergence", convergence_criterion = 0) print(robustified_conv) ``` `outlier_detection()` returns an object of class `"robust2sls"`, which is a list that saves - inter alia - the settings of the algorithm, the 2SLS models for each iteration, and information on which observations were classified as outliers in which iteration. The `print()` method for this class summarizes the most important results. Both the non-iterated version that stops after the initial classification of outliers and the iterated version detect a bit more outliers than we would have expected (13 and 15, respectively). This can be due to the estimation error, i.e. we make the classification based on the residuals $\widehat{u}_{i}$ instead of the true errors $u_{i}$; or simply because there just happened to be more unusual draws of the error in our finite sample than expected. Since we are working with artificial data that we have created ourselves, we can actually check how many true errors were larger than $2.58$ or smaller than $-2.58$. ```{r} sum(abs(data_full[, "u"]) > 2.58) ``` We find that there were 15 true errors that we would technically classify as outliers, of which we detected 13 or 15, respectively. We can also check whether we identified those observations with large true errors as outliers or whether we classified some observations as outliers that actually had true errors that were smaller than the cut-off. The `"robust2sls"` object stores a vector that records for each observation whether it has been classified as an outlier (`0`) or not (`1`) or whether the observation was not used in the estimation for other reasons, such as a missing value in $y$, $x$, or $z$ (`-1`). This information is stored for each iteration. ```{r} # which observations had large true errors? large_true <- which(abs(data_full[, "u"]) > 2.58) # which observations were detected as outliers # both step 0 and iterated version had same starting point, so same initial classification large_detected_0 <- which(robustified_conv$type$m0 == 0) large_detected_3 <- which(robustified_conv$type$m3 == 0) # how much do the sets of true and detected large errors overlap? sum(large_detected_0 %in% large_true) # 12 of the 13 detected outliers really had large errors sum(large_detected_3 %in% large_true) # 13 of the 15 detected outliers really had large errors # which were wrongly detected as having large errors? ind <- large_detected_3[which(!(large_detected_3 %in% large_true))] # what were their true error values? data_full[ind, "u"] # relatively large values even though technically smaller than cut-off # which large true errors were missed? ind2 <- large_true[which(!(large_true %in% large_detected_3))] # what were their true error values? data_full[ind2, "u"] # one of them is close to the cut-off ``` All in all, the algorithm performed as expected. ### Saturated 2SLS As explained above, the Saturated 2SLS estimator partitions the sample into two sub-samples and estimates separate models on each of them. The estimates of one sub-sample are then used to calculate residuals of observations in the other sub-sample. Again, the standardized residuals are compared to the chosen cut-off value. The `split` argument of `outlier_detection()` determines the relative size of each sub-sample. Especially with small samples, the split should be chosen such that none of the sub-samples is *too small*. ```{r} # not iterating the algorithm saturated_0 <- outlier_detection(data = data, formula = fml_tsls, ref_dist = "normal", sign_level = 0.01, initial_est = "saturated", iterations = 0, split = 0.5) print(saturated_0) # iterating algorithm until convergence saturated_conv <- outlier_detection(data = data, formula = fml_tsls, ref_dist = "normal", sign_level = 0.01, initial_est = "saturated", split = 0.5, iterations = "convergence", convergence_criterion = 0) print(saturated_conv) # which did it find in final selection? large_detected_4 <- which(saturated_conv$type$m4 == 0) identical(large_detected_3, large_detected_4) ``` Saturated 2SLS finds 10 outliers initially, which is exactly what we would expect. Once we iterate, it also finds 15 outliers as Robustified 2SLS did. In fact, the converged Saturated 2SLS classified exactly the same observations as outliers as Robustified 2SLS. ## Inference under Null hypothesis of no outliers After using the trimmed 2SLS algorithms, we want to do inference on the structural parameters. Since the procedures exclude observations with large errors, we would underestimate the error variance. [Jiao (2019)](https://drive.google.com/file/d/1qPxDJnLlzLqdk94X9wwVASptf1MPpI2w/view) has derived the correction factor for the estimator of the variance, which is both implemented in the algorithms for the selection and in the command `beta_inf()` function, which provides corrected standard errors. It returns both the original and corrected standard errors, t-statistics, and p-values. The corrected values are calculated under the null hypothesis of no outliers in the sample and are indicated by `H0`. We look at the results from the converged Robust 2SLS algorithm. The algorithm converged after 3 iterations, so we can either use the correction factor for iteration 3 or for the fixed point. ```{r} # final model (iteration 3) without corrected standard errors summary(robustified_conv$model$m3)$coef # final model with corrected standard errors, m = 3 (subset of output) beta_inf(robustified_conv, iteration = 3, exact = TRUE)[, 1:5] # final model with corrected standard errors, m = "fixed point" / "converged" (subset of output) beta_inf(robustified_conv, iteration = 3, exact = TRUE, fp = TRUE)[, 1:5] ``` We can see that the standard inference leads to smaller standard errors, potentially overstating statistical significance because it ignores the preceding selection. For the corrected inference, there is almost no difference between using the correction factor for the exact iteration 3 or using the correction factor for the fixed point. Instead of corrected standard errors, the researcher could also use bootstrapped standard errors for inference. This functionality is still under development. ```{r} # use case re-sampling (to save time, use low iteration m = 1) resampling <- case_resampling(robustified_conv, R = 1000, m = 1) beta_boot <- evaluate_boot(resampling, iterations = 1) mat <- matrix(beta_boot[, 1:2], ncol = 1, nrow = 2) # show subset, put into column vector colnames(mat) <- "bootStd. Error" cbind(beta_inf(robustified_conv, iteration = 1, exact = TRUE)[, 1:3], mat) ``` ## Testing for difference in coefficient estimates [Jiao (2019)](https://drive.google.com/file/d/1qPxDJnLlzLqdk94X9wwVASptf1MPpI2w/view) also devises a Hausman-type test for whether the difference between the full sample and trimmed sample coefficient estimates is statistically significant. The test can be used on subsets of the coefficient vector or the whole vector itself. ```{r} # t-test on a single coefficient # testing difference in beta_2 (coef on endogenous regressor x2), m = 3 beta_t(robustified_conv, iteration = 3, element = "x2") # testing difference in beta_2 (coef on endogenous regressor x2), m = "fixed point" beta_t(robustified_conv, iteration = 3, element = "x2", fp = TRUE) # Hausman-type test on whole coefficient vector, m = 3 beta_hausman(robustified_conv, iteration = 3) ``` While the difference between $\widehat{\beta}_{2}^{(0) = full}$ and $\widehat{\beta}_{2}^{(3)}$ is statistically significant at the 5% significance level using a t-test, the difference is not significant at 5% when testing the whole coefficient vector. # Example with outliers ## Data We now create a contaminated data set from the previous data. To ensure that we know where the outliers are, we populate the data with deterministic outliers, i.e. by setting their errors to a large value. We populate 3% of the sample with outliers, i.e. 30 outliers. The location of the outliers is random. The size of their errors is either -3.5 or 3.5, which is also determined randomly. ```{r} data_full_contaminated <- data_full outlier_location <- sample(1:NROW(data_full), size = 0.03*NROW(data_full), replace = FALSE) outlier_size <- sample(c(-3.5, 3.5), size = length(outlier_location), replace = TRUE) # replace errors data_full_contaminated[outlier_location, "u"] <- outlier_size # replace value of dependent variable data_full_contaminated[outlier_location, "y"] <- data_full_contaminated[outlier_location, "y"] - data_full[outlier_location, "u"] + data_full_contaminated[outlier_location, "u"] # extract the data that researcher would actually collect data_cont <- data.frame(data_full_contaminated[, c(1:3, 6)]) ``` ## Outlier detection ### Robustified 2SLS We use the same algorithm setup as before. ```{r} # not iterating the algorithm robustified_0 <- outlier_detection(data = data_cont, formula = fml_tsls, ref_dist = "normal", sign_level = 0.01, initial_est = "robustified", iterations = 0) print(robustified_0) # iterating algorithm until convergence robustified_conv <- outlier_detection(data = data_cont, formula = fml_tsls, ref_dist = "normal", sign_level = 0.01, initial_est = "robustified", iterations = "convergence", convergence_criterion = 0) print(robustified_conv) ``` The initial selection finds 31 outliers, close to the 30 deterministic outliers that we put into the model. The algorithm converges after 7 iterations and ultimately classifies 45 observations as outliers. Again, this might well be because even under the normal distribution of the regular errors, we have draws that are beyond the cut-off. ```{r} # which observations were classified as outliers? detected_0 <- which(robustified_conv$type$m0 == 0) detected_7 <- which(robustified_conv$type$m7 == 0) # overlap between deterministic outliers sum(outlier_location %in% detected_0) # initial found all 30 (+1 in addition) sum(outlier_location %in% detected_7) # converged found all 30 (+ 15 in addition) ``` Let us now compare the coefficient estimates between the contaminated full sample model and the Robustified 2SLS estimates. ```{r} # full sample model tsls_cont <- ivreg(formula = fml_tsls, data = data_cont) tsls_cont$coefficients # updated estimates after initial classification and removal of outliers robustified_conv$model$m1$coefficients # updated estimates after convergence robustified_conv$model$m7$coefficients ``` In the present setting, the contamination of the model made the full sample estimates of the model slightly worse compared to the estimates based on the non-contaminated sample. The model that applies the algorithm once is closer to the true DGP values of $\beta^{\prime} = \begin{pmatrix} 2 & -1 \end{pmatrix}$ while the converged model performs similarly to the full sample estimates. We now also see a clearer difference between the corrected and the uncorrected standard errors. The corrected standard errors are close to the bootstrap standard error. ```{r} # use case re-sampling (to save time, use low iteration m = 1) resampling <- case_resampling(robustified_conv, R = 1000, m = 1) beta_boot <- evaluate_boot(resampling, iterations = 1) mat <- matrix(beta_boot[, 1:2], ncol = 1, nrow = 2) # show subset, put into column vector colnames(mat) <- "bootStd. Error" cbind(beta_inf(robustified_conv, iteration = 1, exact = TRUE)[, 1:3], mat) ``` ### Saturated 2SLS Finally, we also check the performance of the Saturated 2SLS algorithm. ```{r} # not iterating the algorithm saturated_0 <- outlier_detection(data = data_cont, formula = fml_tsls, ref_dist = "normal", sign_level = 0.01, initial_est = "saturated", iterations = 0, split = 0.5) print(saturated_0) # iterating algorithm until convergence saturated_conv <- outlier_detection(data = data_cont, formula = fml_tsls, ref_dist = "normal", sign_level = 0.01, initial_est = "saturated", split = 0.5, iterations = "convergence", convergence_criterion = 0) print(saturated_conv) # which did it find in final selection? detected_7_sat <- which(saturated_conv$type$m7 == 0) identical(detected_7, detected_7_sat) ``` Saturated 2SLS leads to a similar outcome. In the initial step, one more observation is classified as an outlier compared to Robust 2SLS but their converged classification is again the same.