Splitting Strategies Gallery

Splitting Strategies Gallery#

285 words | 1 min read

Choosing the right cross-validation object is a crucial part of benchmarking a model properly. There are many ways to split data into training, validation, and test sets in order to avoid model overfitting, to standardize the number of groups in test sets, etc.

This example visualizes the behavior of several common splitting strategies for comparison.

Visualise our data#

First, we must understand the structure of our data. It has 100 randomly generated input datapoints, 3 classes split unevenly across datapoints, and 10 “groups” split evenly across datapoints.

As we’ll see, some cross-validation objects do specific things with labeled data, others behave differently with grouped data, and others do not use this information.

To begin, we’ll visualize our data:

../../_images/a92bd9fc13ff0790af1116a91393ff9d426f602501f12124bc9230e9edc17fcd.png

Define a function to visualize splitting behavior#

We’ll define a function that lets us visualize the behavior of each splitting strategy. We’ll perform 4 splits of the data. On each split, we’ll visualize the indices chosen for the training set (in blue), the validation set (in grey), and the test set (in red).

Let’s see how it looks for the k_fold cross-validation object:

/home/runner/work/molflux/molflux/.cache/nox/docs_build-3-11/lib/python3.11/site-packages/sklearn/model_selection/_split.py:91: UserWarning: The groups parameter is ignored by KFold
  warnings.warn(

<Axes: title={'center': 'k_fold'}, xlabel='Sample index', ylabel='CV iteration'>

../../_images/78c37f91d46c2d26b9e91484b23a5786ccf3f6361602c6a6d2f6635b57bd36a3.png

As you can see, by default the k_fold cross-validation iterator does not take either datapoint class or group into consideration. We can change this by using either:

stratified_k_fold to preserve the percentage of samples for each class.
group_k_fold to ensure that the same group will not appear in two different folds.

/home/runner/work/molflux/molflux/.cache/nox/docs_build-3-11/lib/python3.11/site-packages/sklearn/model_selection/_split.py:848: UserWarning: The groups parameter is ignored by StratifiedKFold
  warnings.warn(

../../_images/bf51146675a16b71425e8ad93d92a9ff850937a4d4b80f45b7dd55d0a7b45d29.png

../../_images/81c58f04496cd1c092ce4d3e4f99ccb63afc1b77b556d1e1d53ad2ad6a74074a.png

Next we’ll visualize this behavior for a number of splitting iterators.

Visualize splitting behaviour for many splitting strategies#

Let’s visually compare the splitting and cross validation behavior for many of our splitting strategies. Below we will loop through several common strategies, visualizing the behavior of each.

Note how some use the group/class information while others do not:

/home/runner/work/molflux/molflux/.cache/nox/docs_build-3-11/lib/python3.11/site-packages/sklearn/model_selection/_split.py:91: UserWarning: The groups parameter is ignored by KFold
  warnings.warn(
/home/runner/work/molflux/molflux/.cache/nox/docs_build-3-11/lib/python3.11/site-packages/sklearn/model_selection/_split.py:848: UserWarning: The groups parameter is ignored by StratifiedKFold
  warnings.warn(
/home/runner/work/molflux/molflux/.cache/nox/docs_build-3-11/lib/python3.11/site-packages/sklearn/model_selection/_split.py:1213: UserWarning: The groups parameter is ignored by TimeSeriesSplit
  warnings.warn(

../../_images/d2520128002ae239312fbb0170826484bceb88435ee63b39030d813d01c03c69.png

../../_images/e2215537d8730ec946d1e110714ad939adff719d3819746889205886d5c14b0a.png

../../_images/e8640af035c1f32777cb1c4cb459673696cb4bacc5b68a7589bb55e2817443ec.png

../../_images/2b79dacc60dcec0718404b9d902a7fc2debabe84b21121eb0432398bed01fea3.png

../../_images/2a68e32f9783a10d980a425e3b906f95789796dfbb33860c31aebc1aa8ea721f.png