Basic usage#
In this section, we will illustrate how to use molflux.splits
. These examples will provide you with a starting
point.
Browsing#
First, let’s have a look at what splitting strategies are available for use. These are conveniently categorised (for example,
into core
, rdkit
, etc.). To view what’s available you can do
from molflux.splits import list_splitting_strategies
catalogue = list_splitting_strategies()
print(catalogue)
{'core': ['group_k_fold', 'group_shuffle_split', 'k_fold', 'leave_one_group_out', 'leave_p_groups_out', 'linear_split', 'linear_split_with_rotation', 'ordered_split', 'shuffle_split', 'stratified_k_fold', 'stratified_ordered_split', 'stratified_shuffle_split', 'time_series_split'], 'openeye': ['scaffold'], 'rdkit': ['scaffold_rdkit', 'tanimoto_rdkit']}
This returns a dictionary of available splitting strategies (organised by category and name
). There are a few to choose from.
By default molflux.splits
will come with core
splitters (such as shuffle_split
and k_fold
). You can get more
splitting strategies by pip installing extra packages (such as rdkit
). To see how you can add your own splitting strategy, see
How to add your own splitting strategy.
Loading splitting strategies#
Loading a molflux.splits
strategy is very easy, simply do
from molflux.splits import load_splitting_strategy
strategy = load_splitting_strategy('shuffle_split')
print(strategy)
SplittingStrategy(
name: "shuffle_split",
tag: "shuffle_split",
signature: self.split(dataset: collections.abc.Sized, y: collections.abc.Iterable | None = None, groups: collections.abc.Iterable | None = None, *, n_splits: int = 1, train_fraction: float = 0.8, validation_fraction: float = 0.1, test_fraction: float = 0.1, seed: int | None = None, **kwargs: Any) -> collections.abc.Iterator[tuple[collections.abc.Iterable[int], collections.abc.Iterable[int], collections.abc.Iterable[int]]],
description: """
Random permutation cross-validator.
""",
usage: """
Args:
dataset: The data to be split.
y (optional): The target variable for supervised learning problems.
groups (optional): Group labels for the samples used while splitting the dataset.
n_splits (optional): The number of splits to generate. Defaults to 1.
train_fraction (optional): The proportion of the dataset to include in the train split.
Defaults to 0.8.
validation_fraction: The proportion of the dataset to include in the validation split.
Defaults to 0.1.
test_fraction: The proportion of the dataset to include in the test split.
Defaults to 0.1.
seed (optional): Controls the shuffling applied to the data before applying
the split. Pass an int for reproducible output across multiple function calls.
Defaults to None.
"""
state: {}
)
By printing the loaded strategy, you get more information about it. Each splitting strategy has a name
, and tag
(to identify it). You can also see the optional splitting arguments (and their default values) in the signature.
There is also a short description of the strategy.
You can also load a splitting strategy from a config. A splitting strategy config is a dictionary specifying the strategy to be loaded. A config dictionary must have the following format
splitting_strategy_dict = {
'name': '<name of the strategy>',
'config': '<kwargs for instantiating strategy>'
'presets': '<kwarg presets for splitting>'
}
The name
keys specify the name
of the splitting strategy to load from the catalogue.
The config
key specifies the arguments that are needed for instantiating the splitting strategy and
the presets
key specifies preset keyword arguments to apply when splitting (for example, the train and test fractions). If neither is
specified, the splitting strategy will use default values. Loading from a config is done using the load_from_dict
function.
from molflux.splits import load_from_dict
config = {
'name': 'shuffle_split',
'presets':
{
'train_fraction': 0.8,
'validation_fraction': 0.0,
'test_fraction': 0.2,
}
}
strategy = load_from_dict(config)
print(strategy.state)
{'train_fraction': 0.8, 'validation_fraction': 0.0, 'test_fraction': 0.2}
For convenience, you can also load a group of strategies all at once by specifying a list of configs.
from molflux.splits import load_from_dicts
config = [
{
'name': 'shuffle_split',
'config':
{
'tag': 'train_test_shuffle',
},
'presets':
{
'train_fraction': 0.8,
'validation_fraction': 0.0,
'test_fraction': 0.2,
}
},
{
'name': 'shuffle_split',
'config':
{
'tag': 'train_val_test_shuffle',
},
'presets':
{
'train_fraction': 0.7,
'validation_fraction': 0.2,
'test_fraction': 0.1,
}
}
]
strategies = load_from_dicts(config)
print(strategies)
{'train_test_shuffle': <molflux.splits.strategies.core.shuffle_split.ShuffleSplit object at 0x7fd3ac9de310>, 'train_val_test_shuffle': <molflux.splits.strategies.core.shuffle_split.ShuffleSplit object at 0x7fd3ac9deb10>}
Finally, you can load strategies from a yaml file. You can use a single yaml file which includes configs for all the molflux
tools
and molflux.splits
will know how to extract the relevant document it needs. To do so, you need to define a yaml file with the
following example document
---
version: v1
kind: splits
specs:
- name: k_fold
presets:
n_splits: 5
...
It consists of a version (this is the version of the config format, for now just v1
), kind
of config (in this case
splits
), and specs
. specs
is where the configs are defined. The yaml file can include
configs for other molflux
modules as well. To load this yaml file, you can simply do
from molflux.splits import load_from_yaml
strategies = load_from_yaml(path_to_yaml_file)
print(strategies)
Splitting#
After loading a splitting strategy, you can apply it to any array-like object to get the split indices.
from molflux.splits import load_splitting_strategy
strategy = load_splitting_strategy('shuffle_split')
folds = strategy.split(range(100))
for train_indices, validation_indices, test_indices in folds:
print(f"TRAIN: ", train_indices)
print(f"VALIDATION: ", validation_indices)
print(f"TEST: ", test_indices)
TRAIN: [ 7 76 59 89 1 50 60 17 27 95 87 20 21 79 3 66 39 41 2 73 94 61 72 54
58 13 30 90 82 15 25 43 12 18 19 65 4 70 56 81 67 78 44 34 32 98 80 84
8 51 97 92 14 53 74 16 22 91 29 68 48 38 52 35 46 31 57 23 37 71 28 85
93 36 99 26 47 33 10 69]
VALIDATION: [83 88 77 11 75 45 63 49 6 0]
TEST: [64 24 62 5 86 9 42 40 96 55]
The .split()
method will return a generator of split indices. Every time you iterate the generator, you get a tuple of
split indices (in the case of shuffle_split
, there is only one tuple, but other strategies such as k_fold
will yield
k
tuples).
Integration with molflux.datasets
#
You can easily split your datasets from molflux.datasets
using molflux.splits
splitting strategies.
To learn more, see here.