Stratified K Fold cross validation, Shuffle split cross validation, Nested cross validation - wwbin2008/Handson_ml_demo-and-some-notes GitHub Wiki

Stratified K Fold cross-validation

Stratification is the process of rearranging the data to ensure each fold is a good representative of the whole. For example in a binary classification problem where each class comprises 50% of the data, it is best to arrange the data such that in every fold, each class comprises around half the instances.

Stratification is generally a better scheme, both in terms of bias and variance, when compared to regular cross-validation.

Difference in KFold and ShuffleSplit output

KFold will divide your data set into prespecified number of folds, and every sample must be in one and only one fold. A fold is a subset of your dataset.

ShuffleSplit will randomly sample your entire dataset during each iteration to generate a training set and a test set. The test_size and train_size parameters control how large the test and training test set should be for each iteration. Since you are sampling from the entire dataset during each iteration, values selected during one iteration, could be selected again during another iteration.

Summary: ShuffleSplit works iteratively, KFold just divides the dataset into k folds.

Difference when doing validation

In KFold, during each round you will use one fold as the test set and all the remaining folds as your training set. However, in ShuffleSplit, during each round n you should only use the training and test set from iteration n. As your data set grows, cross validation time increases, making shufflesplits a more attractive alternate. If you can train your algorithm, with a certain percentage of your data as opposed to using all k-1 folds, ShuffleSplit is an attractive option.

Nested Cross-validation

The purpose of cross-validation in some sense is to estimate an unbiased generalization performance. So we should have the nested cross-validation in some cases.

Nested 𝑘-fold cross validation: Have 2 loops. For every 𝑖 above, we have a 𝑘-fold cross validation nested inside. Nested cross-validation is used to avoid optimistically biased estimates of performance that result from using the same cross-validation to set the values of the hyper-parameters of the model (e.g. the regularisation parameter, 𝐶, and kernel parameters of an SVM) and performance estimation.

The reasons for the bias with illustrative examples and experimental evaluation can be found in the paper, but essentially the point is that if the performance evaluation criterion is used in any way to make choices about the model, then those choices are based on (i) genuine improvements in generalisation performance and (ii) the statistical peculiarities of the particular sample of data on which the performance evaluation criterion is evaluated. In other words, the bias arises because it is possible (all too easy) to over-fit the cross-validation error when tuning the hyper-parameters.

In your case, when you split the data into 2 folds and perform two different tasks on each fold, you are not doing cross validation. The key thing to remember that whatever you want to do, you have perform it independently for every 𝑖∈[1;𝑘].

Thus, in your case, the correct way should be:

In the inner loop, you select best hyper-parameters by subsequently using different hyper-params to train model on the inner training folds and test on the inner testing fold. Then use the best models selected in the inner loop to test on the outer loop. The average results of the outer loop is the estimated performance of your model.

Nested cross validation for model selection

Detailed implementation of nested CV