Splitting

Splitting#

The molflux modules are built to plug into each other seamlessly. If you would like to split your datasets with splitting strategies from molflux.splits (or any other splitting strategies following the same API), you can easily do so as follows

from molflux.datasets import load_dataset, split_dataset

dataset = load_dataset("esol")

# splitting_strategy = <your splitting strategy>

folds = split_dataset(dataset, strategy=splitting_strategy)

This returns a generator of folds. A fold is a datasets.DatasetDict dictionary of datasets with the split names as keys and datasets.Dataset datasets as values. To generate each fold, just iterate through the generator or manually yield from the generator using next.

In practice, the following example should get you started:

from molflux.datasets import load_dataset, split_dataset
from molflux.splits import load_splitting_strategy

dataset = load_dataset("esol")
splitting_strategy =  load_splitting_strategy("k_fold")

folds = split_dataset(dataset, strategy=splitting_strategy)

for fold in folds:
    # do anything you want!
    print(fold)
Repo card metadata block was not found. Setting CardData to empty.
DatasetDict({
    train: Dataset({
        features: ['smiles', 'log_solubility'],
        num_rows: 563
    })
    validation: Dataset({
        features: ['smiles', 'log_solubility'],
        num_rows: 563
    })
    test: Dataset({
        features: ['smiles', 'log_solubility'],
        num_rows: 0
    })
})
DatasetDict({
    train: Dataset({
        features: ['smiles', 'log_solubility'],
        num_rows: 563
    })
    validation: Dataset({
        features: ['smiles', 'log_solubility'],
        num_rows: 563
    })
    test: Dataset({
        features: ['smiles', 'log_solubility'],
        num_rows: 0
    })
})

See also

There is also a complete workflow example that also covers how datasets and splitting are integrated: ESOL Training.