Basic usage#

In this section, we will illustrate how to use molflux.splits. These examples will provide you with a starting point.

Browsing#

First, let’s have a look at what splitting strategies are available for use. These are conveniently categorised (for example, into core, rdkit, etc.). To view what’s available you can do

from molflux.splits import list_splitting_strategies

catalogue = list_splitting_strategies()

print(catalogue)
{'core': ['group_k_fold', 'group_shuffle_split', 'k_fold', 'leave_one_group_out', 'leave_p_groups_out', 'linear_split', 'linear_split_with_rotation', 'ordered_split', 'shuffle_split', 'stratified_k_fold', 'stratified_ordered_split', 'stratified_shuffle_split', 'time_series_split'], 'openeye': ['scaffold'], 'rdkit': ['scaffold_rdkit', 'tanimoto_rdkit']}

This returns a dictionary of available splitting strategies (organised by category and name). There are a few to choose from. By default molflux.splits will come with core splitters (such as shuffle_split and k_fold). You can get more splitting strategies by pip installing extra packages (such as rdkit). To see how you can add your own splitting strategy, see How to add your own splitting strategy.

Loading splitting strategies#

Loading a molflux.splits strategy is very easy, simply do

from molflux.splits import load_splitting_strategy

strategy = load_splitting_strategy('shuffle_split')

print(strategy)
SplittingStrategy(
	name: "shuffle_split",
	tag: "shuffle_split",
	signature: self.split(dataset: collections.abc.Sized, y: collections.abc.Iterable | None = None, groups: collections.abc.Iterable | None = None, *, n_splits: int = 1, train_fraction: float = 0.8, validation_fraction: float = 0.1, test_fraction: float = 0.1, seed: int | None = None, **kwargs: Any) -> collections.abc.Iterator[tuple[collections.abc.Iterable[int], collections.abc.Iterable[int], collections.abc.Iterable[int]]],
	description: """
Random permutation cross-validator.
""",
	usage: """
        Args:
            dataset: The data to be split.
            y (optional): The target variable for supervised learning problems.
            groups (optional): Group labels for the samples used while splitting the dataset.
            n_splits (optional): The number of splits to generate. Defaults to 1.
            train_fraction (optional): The proportion of the dataset to include in the train split.
                Defaults to 0.8.
            validation_fraction: The proportion of the dataset to include in the validation split.
                Defaults to 0.1.
            test_fraction: The proportion of the dataset to include in the test split.
                Defaults to 0.1.
            seed (optional): Controls the shuffling applied to the data before applying
                the split. Pass an int for reproducible output across multiple function calls.
                Defaults to None.
        """
	state: {}
)

By printing the loaded strategy, you get more information about it. Each splitting strategy has a name, and tag (to identify it). You can also see the optional splitting arguments (and their default values) in the signature. There is also a short description of the strategy.

You can also load a splitting strategy from a config. A splitting strategy config is a dictionary specifying the strategy to be loaded. A config dictionary must have the following format

splitting_strategy_dict = {
    'name': '<name of the strategy>',
    'config': '<kwargs for instantiating strategy>'
    'presets': '<kwarg presets for splitting>'
}

The name keys specify the name of the splitting strategy to load from the catalogue. The config key specifies the arguments that are needed for instantiating the splitting strategy and the presets key specifies preset keyword arguments to apply when splitting (for example, the train and test fractions). If neither is specified, the splitting strategy will use default values. Loading from a config is done using the load_from_dict function.

from molflux.splits import load_from_dict
config = {
          'name': 'shuffle_split',
          'presets':
            {
              'train_fraction': 0.8,
              'validation_fraction': 0.0,
              'test_fraction': 0.2,
            }
          }

strategy = load_from_dict(config)

print(strategy.state)
{'train_fraction': 0.8, 'validation_fraction': 0.0, 'test_fraction': 0.2}

For convenience, you can also load a group of strategies all at once by specifying a list of configs.

from molflux.splits import load_from_dicts

config = [
    {
        'name': 'shuffle_split',
        'config':
            {
                'tag': 'train_test_shuffle',
            },
        'presets':
            {
                'train_fraction': 0.8,
                'validation_fraction': 0.0,
                'test_fraction': 0.2,
            }
    },
    {
        'name': 'shuffle_split',
        'config':
            {
                'tag': 'train_val_test_shuffle',
            },
        'presets':
            {
                'train_fraction': 0.7,
                'validation_fraction': 0.2,
                'test_fraction': 0.1,
            }
    }
]

strategies = load_from_dicts(config)

print(strategies)
{'train_test_shuffle': <molflux.splits.strategies.core.shuffle_split.ShuffleSplit object at 0x7fd3ac9de310>, 'train_val_test_shuffle': <molflux.splits.strategies.core.shuffle_split.ShuffleSplit object at 0x7fd3ac9deb10>}

Finally, you can load strategies from a yaml file. You can use a single yaml file which includes configs for all the molflux tools and molflux.splits will know how to extract the relevant document it needs. To do so, you need to define a yaml file with the following example document

---
version: v1
kind: splits
specs:
    - name: k_fold
      presets:
          n_splits: 5

...

It consists of a version (this is the version of the config format, for now just v1), kind of config (in this case splits), and specs. specs is where the configs are defined. The yaml file can include configs for other molflux modules as well. To load this yaml file, you can simply do

from molflux.splits import load_from_yaml

strategies = load_from_yaml(path_to_yaml_file)

print(strategies)

Splitting#

After loading a splitting strategy, you can apply it to any array-like object to get the split indices.

from molflux.splits import load_splitting_strategy

strategy = load_splitting_strategy('shuffle_split')

folds = strategy.split(range(100))

for train_indices, validation_indices, test_indices in folds:
    print(f"TRAIN: ", train_indices)
    print(f"VALIDATION: ", validation_indices)
    print(f"TEST: ", test_indices)
TRAIN:  [ 7 76 59 89  1 50 60 17 27 95 87 20 21 79  3 66 39 41  2 73 94 61 72 54
 58 13 30 90 82 15 25 43 12 18 19 65  4 70 56 81 67 78 44 34 32 98 80 84
  8 51 97 92 14 53 74 16 22 91 29 68 48 38 52 35 46 31 57 23 37 71 28 85
 93 36 99 26 47 33 10 69]
VALIDATION:  [83 88 77 11 75 45 63 49  6  0]
TEST:  [64 24 62  5 86  9 42 40 96 55]

The .split() method will return a generator of split indices. Every time you iterate the generator, you get a tuple of split indices (in the case of shuffle_split, there is only one tuple, but other strategies such as k_fold will yield k tuples).

Integration with molflux.datasets#

You can easily split your datasets from molflux.datasets using molflux.splits splitting strategies. To learn more, see here.