Resampling Method

Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.

Resampling can be used for doing one of the following:

  1. Estimating the precision of sample statistics (median, variance, precentiles).
  2. Validating models by using random subsets
  3. Exchanging labels on data points when performing significance tests.

I would like to cast a shadow on two of the most commonly used resampling methods, cross-validation, and the bootstrap.

Given a data set, the use of a particular statistical learning method is warranted if it results in a low test error. The test error can be easily calculated if a designated test set is available. Unfortunately, this is usually not the case.In the absence of a very large designated test set that can be used to directly estimate the test error rate, a number of techniques can be used to estimate this quantity using the available training data.

  • Validation set approach 

It involves randomly dividing the available set of observations into two parts, a training set, and a validation set or hold-out set. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation set error rate—typically assessed using mean squared error in the case of a quantitative response—provides an estimate of the test error rate.

The validation set approach is conceptually simple and is easy to implement. But it has two potential drawbacks:

1.  The validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.

2.  In the validation approach, only a subset of the observations—those that are included in the training set rather than in the validation set—are used to fit the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

  • Leave-One-Out Cross-Validation 

Leave-one-out cross-validation (LOOCV) is closely related to the validation set approach. LOOCV involves splitting the set of observations into two parts. However, instead of creating two subsets of comparable size, a single observation is used for the validation set, and the remaining observations make up the training set. The statistical learning method is fit on the training observations, and a prediction is made for the excluded observation. The process is repeated on each observation as a test set and the test Mean squared error is mean of all test error.

LOOCV has a couple of major advantages over the validation set approach. First, it has far less bias. Consequently, the LOOCV approach tends not to overestimate the test error rate as much as the validation set approach does.  Second, in contrast to the validation approach which will yield different results when applied repeatedly due to randomness in the training set splits, performing LOOCV multiple times will always yield the same results: there is no randomness in the training set splits.

  • K-fold Cross Validation

This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining.The mean squared error is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. Cross-validation is a very general approach that can be applied to almost any statistical learning method.

K-fold Cross Validation is computationally fast as compared to LOOCV and has an advantage of bias-variance trade-off over the LOOCV and validation set approach.

Stay tuned! BootStrap in next blog.

P.S : Feedback is welcome 🙂

Leave a comment