Title: | Outlier Robust Two-Stage Least Squares Inference and Testing |
---|---|
Description: | An implementation of easy tools for outlier robust inference in two-stage least squares (2SLS) models. The user specifies a reference distribution against which observations are classified as outliers or not. After removing the outliers, adjusted standard errors are automatically provided. Furthermore, several statistical tests for the false outlier detection rate can be calculated. The outlier removing algorithm can be iterated a fixed number of times or until the procedure converges. The algorithms and robust inference are described in more detail in Jiao (2019) <https://drive.google.com/file/d/1qPxDJnLlzLqdk94X9wwVASptf1MPpI2w/view>. |
Authors: | Jonas Kurle [aut, cre] |
Maintainer: | Jonas Kurle <[email protected]> |
License: | GPL-3 |
Version: | 0.2.2 |
Built: | 2025-02-26 05:33:19 UTC |
Source: | https://github.com/jkurle/robust2sls |
The robust2sls package provides two main functionalities. First, it implements an algorithm for determining whether an observations is an outlier based on its standardized residual and re-estimation based on the sub-sample excluding all outliers. This procedure is often used in empirical research to show that the results are not driven by outliers. This package has implemented the algorithm in various forms and the user can select between different initial estimators and how often the algorithm is iterated. The statistical inference is adapted to account for potential false positives (classifying observations as outliers even though they are not).
Second, the robust2sls package provides easy-to-use statistical tests on whether the difference between the original and the outlier-robust estimates is statistically significant. Furthermore, several different statistical tests are implemented to test whether the sample actually contains outliers.
Calculates a Hausman test on the difference between robust and full sample estimates
beta_hausman(robust2sls_object, iteration, subset = NULL, fp = FALSE)
beta_hausman(robust2sls_object, iteration, subset = NULL, fp = FALSE)
robust2sls_object |
An object of class |
iteration |
An integer > 0 specifying the iteration step for which parameters to calculate corrected standard errors. |
subset |
A vector of numeric indices or strings indicating which
coefficients to include in the Hausman test. |
fp |
A logical value whether the fixed point asymptotic variance (TRUE) or the exact iteration asymptotic variance should be used (FALSE). |
Argument fp
determines whether the fixed point asymptotic variance
should be used. This argument is only respected if the specified
iteration
is one of the iterations after the algorithm converged.
beta_hausman
returns a matrix with the value of the Hausman
test statistic and its corresponding p-value. The attribute
"type of avar"
records which asymptotic variance has been used (the
specific iteration or the fixed point). The attribute "coefficients"
stores the names of the coefficients that were included in the Hausman test.
Calculates valid se for coefficients under H0 of no outliers
beta_inf(robust2sls_object, iteration = 1, exact = FALSE, fp = FALSE)
beta_inf(robust2sls_object, iteration = 1, exact = FALSE, fp = FALSE)
robust2sls_object |
An object of class |
iteration |
An integer > 0 specifying the iteration step for which parameters to calculate corrected standard errors. |
exact |
A logical value indicating whether the actually detected share of outliers (TRUE) or the theoretical share (FALSE) should be used. |
fp |
A logical value whether the fixed point standard error correction (TRUE) or the exact iteration correction should be computed (FALSE). |
Argument iteration
specifies which iteration of the robust structural
parameter estimates should be calculated. Iteration 1
refers to the
first robust estimate. Iteration 0
is not a valid argument since it
is the baseline estimate, which is not robust.
The parameter exact
does not matter much under the null hypothesis of
no outliers since the detected share will converge to the theoretical share.
Under the alternative, this function should not be used.
Argument fp
determines whether the fixed point standard error
correction should be computed. This argument is only respected if the
specified iteration
is one of the iterations after the algorithm
converged.
beta_inf
returns the corrected standard errors for the
structural parameters. These are valid under the null hypothesis of no
outliers in the sample. For comparison, the uncorrected standard errors are
also reported.
Calculates the correction factor for inference under H0 of no outliers
beta_inf_correction( robust2sls_object, iteration = 1, exact = FALSE, fp = FALSE )
beta_inf_correction( robust2sls_object, iteration = 1, exact = FALSE, fp = FALSE )
robust2sls_object |
An object of class |
iteration |
An integer > 0 specifying the iteration step for which parameters to calculate corrected standard errors. |
exact |
A logical value indicating whether the actually detected share of outliers (TRUE) or the theoretical share (FALSE) should be used. |
fp |
A logical value whether the fixed point standard error correction (TRUE) or the exact iteration correction should be computed (FALSE). |
Argument iteration
specifies which iteration of the robust structural
parameter estimates should be calculated. Iteration 1
refers to the
first robust estimate. Iteration 0
is not a valid argument since it
is the baseline estimate, which is not robust.
The parameter exact
does not matter much under the null hypothesis of
no outliers since the detected share will converge to the theoretical share.
Under the alternative, this function should not be used.
Argument fp
determines whether the fixed point standard error
correction should be computed. This argument is only respected if the
specified iteration
is one of the iterations after the algorithm
converged.
beta_inf_correction
returns the numeric correction factor.
Conducts a t-test on the difference between robust and full sample estimates
beta_t(robust2sls_object, iteration, element, fp = FALSE)
beta_t(robust2sls_object, iteration, element, fp = FALSE)
robust2sls_object |
An object of class |
iteration |
An integer > 0 specifying the iteration step for which parameters to calculate corrected standard errors. |
element |
An index or a string to select the coefficient which is to be
tested. The index should refer to the index of coefficients in the
|
fp |
A logical value whether the fixed point asymptotic variance (TRUE) or the exact iteration asymptotic variance should be used (FALSE). |
Argument fp
determines whether the fixed point asymptotic variance
should be used. This argument is only respected if the specified
iteration
is one of the iterations after the algorithm converged.
beta_t
returns a matrix with the robust and full sample
estimates of beta, the t statistic on their difference, the standard error of
the difference, and three p-values (two-sided, both one-sided alternatives).
Calculates the asymptotic variance of the difference between robust and full sample estimators of the structural parameters
beta_test_avar(robust2sls_object, iteration, fp = FALSE)
beta_test_avar(robust2sls_object, iteration, fp = FALSE)
robust2sls_object |
An object of class |
iteration |
An integer > 0 specifying the iteration step for which parameters to calculate corrected standard errors. |
fp |
A logical value whether the fixed point asymptotic variance (TRUE) or the exact iteration asymptotic variance should be computed (FALSE). |
Argument fp
determines whether the fixed point asymptotic variance
should be computed. This argument is only respected if the specified
iteration
is one of the iterations after the algorithm converged.
beta_test_avar
returns a dx by dx variance-covariance matrix of
the difference between the robust and full sample structural parameter
estimates of the 2SLS model.
Uses nonparametric case resampling for standard errors of parameters and gauge
case_resampling(robust2sls_object, R, coef = NULL, m = NULL, parallel = FALSE)
case_resampling(robust2sls_object, R, coef = NULL, m = NULL, parallel = FALSE)
robust2sls_object |
An object of class |
R |
An integer specifying the number of resamples. |
coef |
A numeric or character vector specifying which structural
coefficient estimates should be recorded across bootstrap replications.
|
m |
A single numeric or vector of integers specifying for which
iterations the bootstrap statistics should be calculated. |
parallel |
A logical value indicating whether to run the bootstrap sampling in parallel or sequentially. See Details. |
Argument parallel
allows for parallel computing using the
foreach package, so the user has to register a parallel backend before
invoking this command.
Argument coef
is useful if the model includes many controls whose
parameters are not of interest. This can reduce the memory space needed to
store the bootstrap results.
case_resampling
returns an object of class
"r2sls_boot"
. This is a list with three named elements. $boot
stores the bootstrap results as a data frame. The columns record the
different test statistics, the iteration m
, and the number of the
resample, r
. The values corresponding to the original data is stored
as r = 0
. $resamples
is a list of length R
that stores
the indices for each specific resample. $original
stores the original
robust2sls_object
based on which the bootstrapping was done.
count_indices
takes a list of indices for resampling and counts how
often each index was sampled in each resample. The result is returned in two
versions of a matrix where each row corresponds to a different resample and
each column to one index.
count_indices(resamples, indices)
count_indices(resamples, indices)
resamples |
A list of resamples, as created by nonparametric. |
indices |
The vector of original indices from which the resamples were drawn. |
count_indices
returns a list with two names elements. Each
element is a matrix that stores how often each observation/index was
resampled (column) for each resample (row). $count_clean
only has
columns for observations that were available in the indices.
$count_all
counts the occurrence of all indices in the range of
indices that were provided, even if the index was actually not available in
the given indices. These are of course zero since they were not available for
resampling. If the given indices do not skip any numbers, the two coincide.
counttest()
conducts a test whether the number of detected outliers
deviates significantly from the expected number of outliers under the null
hypothesis that there are no outliers in the sample.
counttest( robust2sls_object, alpha, iteration, one_sided = FALSE, tsmethod = c("central", "minlike", "blaker") )
counttest( robust2sls_object, alpha, iteration, one_sided = FALSE, tsmethod = c("central", "minlike", "blaker") )
robust2sls_object |
An object of class |
alpha |
A numeric value between 0 and 1 representing the significance level of the test. |
iteration |
An integer >= 0 or the character "convergence" that determines which iteration is used for the test. |
one_sided |
A logical value whether a two-sided test ( |
tsmethod |
A character specifying the method for calculating two-sided p-values. Ignored for one-sided test. |
See outlier_detection()
and
multi_cutoff()
for creating an object of class
"robust2sls"
or a list thereof.
See exactci::poisson.exact()
for the
different methods of calculating two-sided p-values.
counttest()
returns a data frame with the iteration (m) to be
tested, the actual iteration that was tested (generally coincides with the
iteration that was specified to be tested but is the convergent iteration
if the fixed point is tested), the setting of the probability of exceeding
the cut-off (gamma), the number of detected outliers, the expected number
of outliers under the null hypothesis that there are no outliers, the type
of test (one- or two-sided), the p-value, the significance level
alpha
, the decision, and which method was used to calculate
(two-sided) p-values. The number of rows of the data frame corresponds to
the length of the argument robust2sls_object
.
NOTE (12 Apr 2022): probably superseded by estimate_param_null() function taken out of testing
estimate_param(robust2SLS_object, iteration)
estimate_param(robust2SLS_object, iteration)
robust2SLS_object |
An object of class |
iteration |
An integer >= 0 specifying based on which model iteration the moments should be estimated. The model iteration affects which observations are determined to be outliers and these observations will hence be excluded during the estimation of the moments. |
DO NOT USE YET!
estimate_param
can be used to estimate certain moments of the data
that are required for calculating the asymptotic variance of the gauge. Such
moments are the covariance between the standardised first stage errors and
the structural error , the covariance matrix of the first stage
errors
, the first stage parameter matrix
, and more.
estimate_param
returns a list with a similar structure as the
output of the Monte Carlo functionality generate_param. Hence, the
resulting list can be given to the function gauge_avar as argument
parameters
to return an estimate of the asymptotic variance of the
gauge.
The function is not yet fully developed. The estimators of the moments are at the moment not guaranteed to be consistent for the population moments. DO NOT USE!
estimate_param_null
can be used to estimate certain moments of the
data that are required for calculating the asymptotic variance of the gauge.
Such moments are the covariance between the standardised first stage errors
and the structural error , the covariance matrix of the first
stage errors
, the first stage parameter matrix
, and
more.
estimate_param_null(robust2SLS_object)
estimate_param_null(robust2SLS_object)
robust2SLS_object |
An object of class |
estimate_param_null
returns a list with a similar structure as
the output of the Monte Carlo functionality generate_param. Hence, the
resulting list can be given to the function gauge_avar as argument
parameters
to return an estimate of the asymptotic variance of the
gauge.
The function uses the full sample to estimate the moments. Therefore, they are only consistent under the null hypothesis of no outliers and estimators are likely to be inconsistent under the alternative.
Evaluate bootstrap results
evaluate_boot(r2sls_boot, iterations)
evaluate_boot(r2sls_boot, iterations)
r2sls_boot |
An object of class |
iterations |
An integer or numeric vector with values >= 0 specifying which bootstrap results to evaluate. |
evaluate_boot
returns a data frame with the bootstrap and the
theoretical standard errors. Each row corresponds to a different iteration
step while each column refers to the parameters whose standard errors are
produced.
Extracts bootstrap results for a specific iteration
extract_boot(r2sls_boot, iteration)
extract_boot(r2sls_boot, iteration)
r2sls_boot |
An object of class |
iteration |
An integer >= 0 specifying which bootstrap results to extract. |
extract_boot
returns a matrix with the bootstrap results for
a specific iteration.#'
gauge_avar
calculates the asymptotic variance of the gauge for a
given iteration using a given set of parameters (true or estimated).
gauge_avar( ref_dist = c("normal"), sign_level, initial_est = c("robustified", "saturated", "iis"), iteration, parameters, split )
gauge_avar( ref_dist = c("normal"), sign_level, initial_est = c("robustified", "saturated", "iis"), iteration, parameters, split )
ref_dist |
A character vector that specifies the reference distribution
against which observations are classified as outliers. |
sign_level |
A numeric value between 0 and 1 that determines the cutoff in the reference distribution against which observations are judged as outliers or not. |
initial_est |
A character vector that specifies the initial estimator
for the outlier detection algorithm. |
iteration |
An integer >= 0 or character |
parameters |
A list created by generate_param or
estimate_param_null that stores the parameters (true or estimated).
|
split |
A numeric value strictly between 0 and 1 that determines
in which proportions the sample will be split. Can be |
Initial estimator "iis"
uses the asymptotic variances of
"robustified"
2SLS because there is no formal theory for the
multi-block search.
gauge_avar
returns a numeric value.
gauge_covar
calculates the asymptotic covariance between two FODRs
with different cut-off values s and t for a given iteration using a given set
of parameters (true or estimated).
gauge_covar( ref_dist = c("normal"), sign_level1, sign_level2, initial_est = c("robustified", "saturated", "iis"), iteration, parameters, split )
gauge_covar( ref_dist = c("normal"), sign_level1, sign_level2, initial_est = c("robustified", "saturated", "iis"), iteration, parameters, split )
ref_dist |
A character vector that specifies the reference distribution
against which observations are classified as outliers. |
sign_level1 |
A numeric value between 0 and 1 that determines the first cutoff in the reference distribution against which observations are judged as outliers or not. |
sign_level2 |
A numeric value between 0 and 1 that determines the second cutoff in the reference distribution against which observations are judged as outliers or not. |
initial_est |
A character vector that specifies the initial estimator
for the outlier detection algorithm. |
iteration |
An integer >= 0 or character |
parameters |
A list created by generate_param or
estimate_param_null that stores the parameters (true or estimated).
|
split |
A numeric value strictly between 0 and 1 that determines
in which proportions the sample will be split. Can be |
Initial estimator "iis"
uses the asymptotic variances of
"robustified"
2SLS because there is no formal theory for the
multi-block search.
gauge_covar
returns a numeric value.
generate_data
draws random data for a 2SLS model given the parameters.
generate_data(parameters, n)
generate_data(parameters, n)
parameters |
A list with 2SLS model parameters as created by generate_param. |
n |
Sample size to be drawn. |
generate_data
returns a data frame with n
rows
(observations) and the following variables of the 2SLS model: dependent
variable y, exogenous regressors x1, endogenous regressors x2, structural
error u, outside instruments z2, first stage projection errors r1 (identical
to zero) and r2.
By default, generate_param
creates random parameters of a 2SLS model
that satisfy conditions for 2SLS models, such as positive definite
variance-covariance matrices. The user can also specify certain parameters
directly, which are then checked for their validity.
generate_param( dx1, dx2, dz2, intercept = TRUE, beta = NULL, sigma = 1, mean_z = NULL, cov_z = NULL, Sigma2_half = NULL, Omega2 = NULL, Pi = NULL, seed = 42 )
generate_param( dx1, dx2, dz2, intercept = TRUE, beta = NULL, sigma = 1, mean_z = NULL, cov_z = NULL, Sigma2_half = NULL, Omega2 = NULL, Pi = NULL, seed = 42 )
dx1 |
An integer value specifying the number of exogenous regressors.
This should include the intercept if it is present in the model
(see argument |
dx2 |
An integer value specifying the number of endogenous regressors. |
dz2 |
An integer value specifying the number of outside / excluded instruments. |
intercept |
A logical value ( |
beta |
A numeric vector of length |
sigma |
A strictly positive numeric value specifying the standard deviation of the error in the structural model. |
mean_z |
A numeric vector of length |
cov_z |
A numeric positive definite matrix specifying the variance-covariance matrix of the exogenous variables, x1 and z2. |
Sigma2_half |
A numeric positive definite matrix of dimension
|
Omega2 |
A numeric vector of length |
Pi |
A numeric matrix of dimension |
seed |
An integer for setting the seed for the random number generator. |
generate_param
returns a list with the (randomly created or
user-specified) parameters that are required for drawing random data that.
The parameters are generated to fulfill the 2SLS model assumptions.
$structural
A list with two components storing the mean
($mean
) and variance-covariance matrix ($cov
) for the
structural error (u), the random first stage errors (r2), and all
instruments (excluding the intercept since it is not random) (z).
$params
A list storing the parameters of the 2SLS model.
$beta
is the coefficient vector (including intercept if present) of
the structural equation, $Pi
the coefficient matrix of the first
stage projections, $Omega2
the covariance between the structural
error and the endogenous first stage errors, $Sigma2_half
the square
root of the variance-covariance matrix of the endogenous first stage
errors, $mean_z
the mean of all instruments (excluding the intercept
since it is not random), $cov_z
the variance-covariance matrix of
the endogenous first-stage errors, $Ezz
the expected value of the
squared instruments.
$settings
A list storing the function call ($call
),
whether an intercept is included in the model ($intercept
), a
regression formula for the model setup ($formula
), and the
dimensions of the regressors and instruments ($dx1
, $dx2
,
$dz2
.
$names
A list storing generic names for the regressors,
instruments, and errors as character vectors ($x1
, $x2
,
$x
, $z2
, $z
, $r
, and $u
).
globaltest()
uses several proportion or count tests with different
cut-offs to test a global hypothesis of no outliers using the Simes (1986)
procedure to account for multiple testing.
globaltest(tests, global_alpha)
globaltest(tests, global_alpha)
tests |
A data frame that contains a column named |
global_alpha |
A numeric value representing the global significance level. |
See Simes (1986).
A list with three entries. The first entry named $reject
contains the global rejection decision. The second entry named
$global_alpha
stores the global significance level. The third entry
named $tests
returns the input data frame tests
, appended
with two columns containing the adjusted significance level and respective
rejection decision.
[proptest()], [counttest()]
Impulse Indicator Saturation (IIS initial estimator)
iis_init( data, formula, gamma, t.pval = gamma, do.pet = FALSE, normality.JarqueB = NULL, turbo = FALSE, overid = NULL, weak = NULL )
iis_init( data, formula, gamma, t.pval = gamma, do.pet = FALSE, normality.JarqueB = NULL, turbo = FALSE, overid = NULL, weak = NULL )
data |
A dataframe. |
formula |
A formula in the format |
gamma |
A numeric value between 0 and 1 representing the significance level used for two-sided significance t-test on the impulse indicators. Corresponds to the probability of falsely classifying an observation as an outlier. |
t.pval |
A numeric value between 0 and 1 representing the significance level for the Parsimonious Encompassing Test (PET). |
do.pet |
logical. If |
normality.JarqueB |
|
turbo |
logical. If |
overid |
|
weak |
|
iis_init
returns a list with five elements. The first
four are vectors whose length equals the number of observations in the data
set. Unlike the residuals stored in a model object (usually accessible via
model$residuals
), it does not ignore observations where any of y, x
or z are missing. It instead sets their values to NA
.
The first element is a double vector containing the residuals for each
observation based on the model estimates. The second element contains the
standardised residuals, the third one a logical vector with TRUE
if
the observation is judged as not outlying, FALSE
if it is an outlier,
and NA
if any of y, x, or z are missing. The fourth element of the
list is an integer vector with three values: 0 if the observations is judged
to be an outlier, 1 if not, and -1 if missing. The fifth and last element
stores the ivreg
model object based on which the four
vectors were calculated.
IIS runs multiple models, similar to saturated_init
but with
multiple block search. These intermediate models are not recorded. For
simplicity, the element $model
of the returned list stores the full
sample model result, identical to robustified_init
.
WARNING: not for average user - function not completed yet
mc_grid( M, n, seed, parameters, formula, ref_dist, sign_level, initial_est, iterations, convergence_criterion = NULL, max_iter = NULL, shuffle = FALSE, shuffle_seed = 10, split = 0.5, path = FALSE, verbose = FALSE )
mc_grid( M, n, seed, parameters, formula, ref_dist, sign_level, initial_est, iterations, convergence_criterion = NULL, max_iter = NULL, shuffle = FALSE, shuffle_seed = 10, split = 0.5, path = FALSE, verbose = FALSE )
M |
Number of replications. |
n |
Sample size for each replication. |
seed |
Random seed for the iterations. |
parameters |
A list as created by generate_param that specifies the true model. |
formula |
A formula that specifies the 2SLS model to be estimated. The
format has to follow |
ref_dist |
A character vector that specifies the reference distribution
against which observations are classified as outliers. |
sign_level |
A numeric value between 0 and 1 that determines the cutoff in the reference distribution against which observations are judged as outliers or not. |
initial_est |
A character vector that specifies the initial estimator
for the outlier detection algorithm. |
iterations |
An integer >= 0 that specifies how often the outlier
detection algorithm is iterated and for which summary statistics will be
calculated. The value |
convergence_criterion |
A numeric value that determines whether the
algorithm has converged as measured by the L2 norm of the difference in
coefficients between the current and the previous iteration. Only used when
argument |
max_iter |
A numeric value >= 1 or NULL. If
|
shuffle |
A logical value or |
shuffle_seed |
An integer value that will set the seed for shuffling the
sample or |
split |
A numeric value strictly between 0 and 1 that determines in which proportions the sample will be split. |
path |
A character string or |
verbose |
A logical value whether any messages should be printed. |
mc_grid
runs Monte Carlo simulations to assess the performance of
the theory of the gauge, simple proportion tests, and count tests.
mc_grid
returns a data frame with the results of the Monte
Carlo experiments. Each row corresponds to a specific simulation setup. The
columns record the simulation setup and its results. Currently, the average
proportion of detected outliers ("mean_gauge") and their variance
("var_gauge") are being recorded. Moreover, the theoretical asymptotic
variance ("avar") and the ratio of simulated to theoretical variance -
adjusted by the sample size - are calculated ("var_ratio"). Furthermore,
tentative results of size and power for the tests are calculated.
Requires the package doRNG to be installed, which has been orphaned as of 2022-12-09.
The following arguments can also be supplied as a vector of their type:
n
, sign_level
, initial_est
, and split
. This makes
the function estimate all possible combinations of the arguments. Note that
the initial estimator "robustified"
is not affected by the argument
split
and hence is not varied in this case.
For example, specifying n = c(100, 1000)
and
sign_level = c(0.01, 0.05)
estimates four Monte Carlo experiments with
the four possible combinations of the parameters.
The path
argument allows users to store the M
replication
results for all of the individual Monte Carlo simulations that are part of
the grid. The results are saved both as .Rds
and .csv
files.
The file name is indicative of the simulation setting.
multi_cutoff()
runs several outlier detection algorithms that differ
in the value of the cut-off that determines whether an observation is
classified as an outlier or not.
multi_cutoff(gamma, ...)
multi_cutoff(gamma, ...)
gamma |
A numeric vector representing the probability of falsely
classifying an observation as an outlier. One setting of the algorithm per
element of |
... |
Arguments for specifying the other settings of the outlier
detection algorithm, |
mutli_cutoff
uses the
foreach
and
future
packages to run several models at the
same time in parallel. This means the user has to register a backend and
thereby determine how the code should be executed. The default is
sequential, i.e. not in parallel. See
future::plan()
for details.
A list containing the robust2sls
objects, one per setting of
gamma
. The length of the list therefore corresponds to the length of
the vector gamma
.
mvn_sup
simulates the distribution of the supremum of the specified
multivariate normal distribution by drawing repeatedly from the multivariate
normal distribution and calculating the maximum of each vector.
mvn_sup(n, mu, Sigma, seed = NULL)
mvn_sup(n, mu, Sigma, seed = NULL)
n |
An integer determining the number of draws from the multivariate normal distribution. |
mu |
A numeric vector representing the mean of the multivariate normal distribution. |
Sigma |
A numeric matrix representing the variance-covariance matrix of the mutlivariate normal distribution. |
seed |
An integer setting the random seed or |
mvn_sup
returns a vector of suprema of length n
.
nonparametric
is used for nonparametric resampling, for example
nonparametric case or error/residual resampling. The function takes a vector
of indices that correspond to the indices of observations that should be used
in the resampling procedure.
nonparametric( indices, R, size = length(indices), replacement = TRUE, seed = NULL )
nonparametric( indices, R, size = length(indices), replacement = TRUE, seed = NULL )
indices |
A vector of indices (integer) from which to sample. |
R |
An integer specifying the number of resamples. |
size |
An integer specifying the size of the resample. Standard bootstrap suggests to resample as many datapoints as in the original sample, which is set as the default. |
replacement |
A logical value whether to sample with (TRUE) or without (FALSE) replacement. Standard bootstrap suggests to resample with replacement, which is set as the default. |
seed |
|
nonparametric
returns a list of length R
containing
vectors with the resampled indices.
Nonparametric resampling from a data frame
nonparametric_resampling(df, resample)
nonparametric_resampling(df, resample)
df |
Data frame containing observations to be sampled from. |
resample |
A vector of indices that extract the observations from the data frame. |
The input to the resample
argument could for example be generated as
one of the elements in the list generated by the command
nonparametric.
The input to the df
argument would be the original data frame for case
resampling. For error/residual resampling, it would be a data frame
containing the residuals from the model.
nonparametric_resampling
returns a data frame containing the
observations of the resample.
outlier
takes a "robust2sls"
object and the index of a specific
observation and returns its history of classification across the different
iterations contained in the "robust2sls"
object.
outlier(robust2sls_object, obs)
outlier(robust2sls_object, obs)
robust2sls_object |
An object of class |
obs |
An index (row number) of an observation |
outlier
returns a vector that contains the 'type' value for
the given observations across the different iterations. There are three
possible values: 0 if the observations is judged to be an outlier, 1 if not,
and -1 if any of its x, y, or z values required for estimation is missing.
outlier_detection
provides different types of outlier detection
algorithms depending on the arguments provided. The decision whether to
classify an observations as an outlier or not is based on its standardised
residual in comparison to some user-specified reference distribution.
The algorithms differ mainly in two ways. First, they can differ by the use
of initial estimator, i.e. the estimator based on which the first
classification as outliers is made. Second, the algorithm can either be
iterated a fixed number of times or until the difference in coefficient
estimates between the most recent model and the previous one is smaller than
some user-specified convergence criterion. The difference is measured by
the L2 norm.
outlier_detection( data, formula, ref_dist = c("normal"), sign_level, initial_est = c("robustified", "saturated", "user", "iis"), user_model = NULL, iterations = 1, convergence_criterion = NULL, max_iter = NULL, shuffle = FALSE, shuffle_seed = NULL, split = 0.5, verbose = FALSE, iis_args = NULL )
outlier_detection( data, formula, ref_dist = c("normal"), sign_level, initial_est = c("robustified", "saturated", "user", "iis"), user_model = NULL, iterations = 1, convergence_criterion = NULL, max_iter = NULL, shuffle = FALSE, shuffle_seed = NULL, split = 0.5, verbose = FALSE, iis_args = NULL )
data |
A dataframe. |
formula |
A formula for the |
ref_dist |
A character vector that specifies the reference distribution
against which observations are classified as outliers. |
sign_level |
A numeric value between 0 and 1 that determines the cutoff in the reference distribution against which observations are judged as outliers or not. |
initial_est |
A character vector that specifies the initial estimator
for the outlier detection algorithm. |
user_model |
A model object of class ivreg. Only
required if argument |
iterations |
Either an integer >= 0 that specifies how often the outlier
detection algorithm is iterated, or the character vector
|
convergence_criterion |
A numeric value or NULL. The algorithm stops as
soon as the difference in coefficient estimates between the most recent model
and the previous one is smaller than |
max_iter |
A numeric value >= 1 or NULL. If
|
shuffle |
A logical value or |
shuffle_seed |
An integer value that will set the seed for shuffling the
sample or |
split |
A numeric value strictly between 0 and 1 that determines in which proportions the sample will be split. |
verbose |
A logical value whether progress during estimation should be reported. |
iis_args |
A list with named entries corresponding to the arguments for
|
outlier_detection
returns an object of class
"robust2sls"
, which is a list with the following components:
$cons
A list which stores high-level information about the
function call and some results. $call
is the captured function call,
$formula
the formula argument, $data
the original data set,
$reference
the chosen reference distribution to classify outliers,
$sign_level
the significance level, $psi
the probability that
an observation is not classified as an outlier under the null hypothesis
of no outliers, $cutoff
the cutoff used to classify outliers if
their standardised residuals are larger than that value, $bias_corr
a bias correction factor to account for potential false positives
(observations classified as outliers even though they are not). There are
three further elements that are lists themselves. $initial
stores settings about the initial estimator:
$estimator
is the type of the initial estimator (e.g. robustified or
saturated), $split
how the sample is split (NULL
if argument
not used), $shuffle
whether the sample is shuffled before splitting
(NULL
if argument not used), $shuffle_seed
the value of the
random seed (NULL
if argument not used). $convergence
stores information about the convergence of the
outlier-detection algorithm:
$criterion
is the user-specified convergence criterion (NULL
if argument not used), $difference
is the L2 norm between the last
coefficient estimates and the previous ones (NULL
if argument not
used or only initial estimator calculated). $converged
is a logical
value indicating whether the algorithm has converged, i.e. whether the
difference is smaller than the convergence criterion (NULL
if
argument not used). $max_iter
is the maximum iteration set by the
user (NULL
if argument not used or not set). $iterations
contains information about the user-specified iterations
argument ($setting
) and the actual number of iterations that were
done ($actual
). The actual number can be lower if the algorithm
converged already before the user-specified number of iterations were
reached.
$model
A list storing the model objects of class
ivreg for each iteration. Each model is stored under
$m0
, $m1
, ...
$res
A list storing the residuals of all observations for
each iteration. Residuals of observations where any of the y, x, or z
variables used in the 2SLS model are missing are set to NA. Each vector is
stored under $m0
, $m1
, ...
$stdres
A list storing the standardised residuals of all
observations for each iteration. Standardised residuals of observations
where any of the y, x, or z variables used in the 2SLS model are missing
are set to NA. Standardisation is done by dividing by sigma, which is not
adjusted for degrees of freedom. Each vector is stored under $m0
,
$m1
, ...
$sel
A list of logical vectors storing whether an observation
is included in the estimation or not. Observations are excluded (FALSE) if
they either have missing values in any of the x, y, or z variables needed
in the model or when they are classified as outliers based on the model.
Each vector is stored under $m0
, $m1
, ...
$type
A list of integer vectors indicating whether an
observation has any missing values in x, y, or z (-1
), whether it is
classified as an outlier (0
) or not (1
). Each vector is
stored under $m0
, $m1
, ...
Check Jiao (2019)
(as well as forthcoming working paper in the future) about conditions on the
initial estimator that should be satisfied for the initial estimator when
using initial_est == "user"
(e.g. they have to be Op(1)).
IIS is a generalisation of Saturated 2SLS
with
multiple block search but no asymptotic theory exists for IIS.
outliers
calculates the number of outliers from a "robust2sls"
object for a given iteration.
outliers(robust2sls_object, iteration)
outliers(robust2sls_object, iteration)
robust2sls_object |
An object of class |
iteration |
An integer >= 0 representing the iteration for which the outliers are calculated. |
outliers
returns the number of outliers for a given iteration
as determined by the outlier-detection algorithm.
outliers_prop
calculates the proportion of outliers relative to all
non-missing observations in the full sample from a "robust2sls"
object
for a given iteration.
outliers_prop(robust2sls_object, iteration)
outliers_prop(robust2sls_object, iteration)
robust2sls_object |
An object of class |
iteration |
An integer >= 0 representing the iteration for which the outliers are calculated. |
outliers_prop
returns the proportion of outliers for a given
iteration as determined by the outlier-detection algorithm.
Plot method for objects of class "robust2sls"
. Plots the
standardised residuals of non-missing observations for a given iteration of
the outlier-detection algorithm and distinguishes whether an observation is
classified as an outlier by colour.
## S3 method for class 'robust2sls' plot(x, iteration = NULL, ...)
## S3 method for class 'robust2sls' plot(x, iteration = NULL, ...)
x |
An object of class |
iteration |
Either |
... |
Arguments to be passed to methods, see plot. |
plot.robust2sls
returns a graph of class
ggplot.
robust2sls
allows the user to create an object of class
"robust2sls"
by specifying the different components of the list. The
validator function validate_robust2sls
is called at the end to ensure
that the resulting object is a valid object of class
"robust2sls"
.
## S3 method for class 'robust2sls' print(x, verbose = FALSE, ...)
## S3 method for class 'robust2sls' print(x, verbose = FALSE, ...)
x |
An object of class |
verbose |
A logical value, |
... |
Further arguments passed to or from other methods, see print. |
Printing summary output
Print method for objects of class "robust2sls"
. Prints a
high-level summary of the settings and results of the outlier-detection
algorithm.
No return value, prints model summary.
proptest()
conducts a test whether the false outlier detection rate
(FODR) in the sample deviates significantly from its expected value
(population FODR) under the null hypothesis that there are no outliers in the
sample.
proptest(robust2sls_object, alpha, iteration, one_sided = FALSE)
proptest(robust2sls_object, alpha, iteration, one_sided = FALSE)
robust2sls_object |
An object of class |
alpha |
A numeric value between 0 and 1 representing the significance level of the test. |
iteration |
An integer >= 0 or the character "convergence" that determines which iteration is used for the test. |
one_sided |
A logical value whether a two-sided test ( |
See outlier_detection()
and
multi_cutoff()
for creating an object of class
"robust2sls"
or a list thereof.
proptest()
returns a data frame with the iteration (m) to be
tested, the actual iteration that was tested (generally coincides with the
iteration that was specified to be tested but is the convergent iteration if
the fixed point is tested), the setting of the probability of exceeding the
cut-off (gamma), the type of t-test (one- or two-sided), the value of the
test statistic, its p-value, the significance level alpha
, and the
decision. The number of rows of the data frame corresponds to the length of
the argument robust2sls_object
.
robustified_init
estimates the full sample 2SLS model, which is used
as the initial estimator for the iterative procedure.
robustified_init(data, formula, cutoff)
robustified_init(data, formula, cutoff)
data |
A dataframe. |
formula |
A formula in the format |
cutoff |
A numeric cutoff value used to judge whether an observation is an outlier or not. If its absolute value is larger than the cutoff value, the observations is classified as an outlier. |
robustified_init
returns a list with five elements. The first
four are vectors whose length equals the number of observations in the data
set. Unlike the residuals stored in a model object (usually accessible via
model$residuals
), it does not ignore observations where any of y, x
or z are missing. It instead sets their values to NA
.
The first element is a double vector containing the residuals for each
observation based on the model estimates. The second element contains the
standardised residuals, the third one a logical vector with TRUE
if
the observation is judged as not outlying, FALSE
if it is an outlier,
and NA
if any of y, x, or z are missing. The fourth element of the
list is an integer vector with three values: 0 if the observations is judged
to be an outlier, 1 if not, and -1 if missing. The fifth and last element
stores the ivreg
model object based on which the four
vectors were calculated.
saturated_init
splits the sample into two sub-samples. The 2SLS model
is estimated on both sub-samples and the estimates of one sub-sample are
used to calculate the residuals and hence outliers from the other sub-sample.
saturated_init(data, formula, cutoff, shuffle, shuffle_seed, split = 0.5)
saturated_init(data, formula, cutoff, shuffle, shuffle_seed, split = 0.5)
data |
A dataframe. |
formula |
A formula in the format |
cutoff |
A numeric cutoff value used to judge whether an observation is an outlier or not. If its absolute value is larger than the cutoff value, the observations is classified as an outlier. |
shuffle |
A logical value ( |
shuffle_seed |
A numeric value that sets the seed for shuffling the
data set before splitting it. Only used if |
split |
A numeric value strictly between 0 and 1 that determines in which proportions the sample will be split. |
saturated_init
returns a list with five elements. The first
four are vectors whose length equals the number of observations in the data
set. Unlike the residuals stored in a model object (usually accessible via
model$residuals
), it does not ignore observations where any of y, x
or z are missing. It instead sets their values to NA
.
The first element is a double vector containing the residuals for each
observation based on the model estimates. The second element contains the
standardised residuals, the third one a logical vector with TRUE
if
the observation is judged as not outlying, FALSE
if it is an outlier,
and NA
if any of y, x, or z are missing. The fourth element of the
list is an integer vector with three values: 0 if the observations is judged
to be an outlier, 1 if not, and -1 if missing. The fifth and last element
is a list with the two initial ivreg
model objects based
on the two different sub-samples.
The estimator may have bad properties if the split
is too unequal and
the sample size is not large enough.
selection_iis
uses the data and isat model object to create a list
with five elements that are used to determine whether the observations are
judged as outliers or not.
selection_iis(x, data, yvar, complete, rownames_orig, refmodel)
selection_iis(x, data, yvar, complete, rownames_orig, refmodel)
x |
An object of class |
data |
A dataframe. |
yvar |
A character vector of length 1 that refers to the name of the dependent variable in the data set. |
complete |
A logical vector with the same length as the number of observations in the data set that specifies whether an observation has any missing values in any of y, x, or z variables. |
rownames_orig |
A character vector storing the original rownames of the dataframe. |
refmodel |
A model object that will be stored in |
A list with five elements. The first four are vectors whose length
equals the number of observations in the data set. Unlike the residuals
stored in a model object (usually accessible via model$residuals
), it
does not ignore observations where any of y, x or z are missing. It instead
sets their values to NA
.
The first element is a double vector containing the residuals for each
observation based on the model estimates. The second element contains the
standardised residuals, the third one a logical vector with TRUE
if
the observation is judged as not outlying, FALSE
if it is an outlier,
and NA
if any of y, x, or z are missing. The fourth element of the
list is an integer vector with three values: 0 if the observations is judged
to be an outlier, 1 if not, and -1 if missing. The fifth and last element
stores the ivreg
model object based on which the four
vectors were calculated.
IIS runs multiple models, similar to saturated_init
but with
multiple block search. These intermediate models are not recorded. For
simplicity, the element $model
of the returned list stores the full
sample model result, identical to robustified_init
.
Unlike the residuals stored in a model object (usually accessible via
model$residuals
), this function returns vectors of the same length as
the original data set even if any of the y, x, or z variables are missing.
The residuals for those observations are set to NA
.
sumtest()
uses the estimations across several cut-offs to test whether
the sum of the deviations between sample and population FODR differ
significantly from its expected value.
\[\sum_{k = 1}^{K} \sqrt{n}(\widehat{\gamma}_{c_{k}} - \gamma_{c_{k}}) \]
sumtest(robust2sls_object, alpha, iteration, one_sided = FALSE)
sumtest(robust2sls_object, alpha, iteration, one_sided = FALSE)
robust2sls_object |
A list of |
alpha |
A numeric value between 0 and 1 representing the significance level of the test. |
iteration |
An integer >= 0 or the character "convergence" that determines which iteration is used for the test. |
one_sided |
A logical value whether a two-sided test ( |
sumtest()
returns a data frame with one row storing the
iteration that was tested, the value of the test statistic (t-test), the
type of the test (one- or two-sided), the corresponding p-value, the
significance level, and whether the null hypothesis is rejected. The data
frame also contains an attribute named "gammas"
that records which
gammas determining the different cut-offs were used in the scaling sum test.
suptest()
uses the estimations across several cut-offs to test whether
the supremum/maximum of the deviations between sample and population FODR
differs significantly from its expected value.
\[ \sup_{c} |\sqrt{n}(\widehat{\gamma}_{c} - \gamma_{c})| \]
suptest(robust2sls_object, alpha, iteration, p = c(0.9, 0.95, 0.99), R = 50000)
suptest(robust2sls_object, alpha, iteration, p = c(0.9, 0.95, 0.99), R = 50000)
robust2sls_object |
A list of |
alpha |
A numeric value between 0 and 1 representing the significance level of the test. |
iteration |
An integer >= 0 or the character "convergence" that determines which iteration is used for the test. |
p |
A numeric vector of probabilities with values in [0,1] for which the corresponding quantiles are calculated. |
R |
An integer specifying the number of replications for simulating the distribution of the test statistic. |
suptest()
returns a data frame with one row storing the
iteration that was tested, the value of the test statistic, the corresponding
p-value, the significance level, and whether the null hypothesis is rejected.
The data frame also contains two named attributes. The first attribute is
named "gammas"
and records which gammas determining the different
cut-offs were used in the scaling sup test. The second attribute is named
"critical"
and records the critical values corresponding to the
different quantiles in the limiting distribution that were specified in
p
.
user_init
uses a model supplied by the user as the initial estimator.
Based on this estimator, observations are classified as outliers or not.
user_init(data, formula, cutoff, user_model)
user_init(data, formula, cutoff, user_model)
data |
A dataframe. |
formula |
A formula in the format |
cutoff |
A numeric cutoff value used to judge whether an observation is an outlier or not. If its absolute value is larger than the cutoff value, the observations is classified as an outlier. |
user_model |
A model object of class ivreg whose parameters are used to calculate the residuals. |
user_init
returns a list with five elements. The first
four are vectors whose length equals the number of observations in the data
set. Unlike the residuals stored in a model object (usually accessible via
model$residuals
), it does not ignore observations where any of y, x
or z are missing. It instead sets their values to NA
.
The first element is a double vector containing the residuals for each
observation based on the model estimates. The second element contains the
standardised residuals, the third one a logical vector with TRUE
if
the observation is judged as not outlying, FALSE
if it is an outlier,
and NA
if any of y, x, or z are missing. The fourth element of the
list is an integer vector with three values: 0 if the observations is judged
to be an outlier, 1 if not, and -1 if missing. The fifth and last element
stores the ivreg
user-specified model object based on
which the four vectors were calculated.
Check Jiao (2019) about conditions on the initial estimator that should be satisfied for the initial estimator (e.g. they have to be Op(1)).
validate_robust2sls
checks that the input is a valid object of
class "robust2sls"
.
validate_robust2sls(x)
validate_robust2sls(x)
x |
An object whose validity of class |
If the object is a valid "robust2sls"
object then the function
returns the object. No return value otherwise.
varrho
calculates the coefficients for the asymptotic variance of the
gauge (false outlier detection rate) for a specific iteration m >= 1.
varrho(sign_level, ref_dist = c("normal"), iteration)
varrho(sign_level, ref_dist = c("normal"), iteration)
sign_level |
A numeric value between 0 and 1 that determines the cutoff in the reference distribution against which observations are judged as outliers or not. |
ref_dist |
A character vector that specifies the reference distribution
against which observations are classified as outliers. |
iteration |
An integer >= 1 that specifies the iteration of the outlier detection algorithm. |
varrho
returns a list with four components, all of which are
lists themselves. $setting
stores the arguments with which the
function was called. $c
stores the values of the six different
coefficients for the specified iteration. $fp
contains the fixed point
versions of the six coefficients. $aux
stores intermediate values
required for calculating the coefficients.