Basic usage#

In this section, we will quickly illustrate how to use molflux.datasets. These examples will provide you with a starting point. Much of the low level functionality is already documented in the HuggingFace dataset docs. Here, we will go through the basics and the added functionality from molflux.

Browsing#

First, we use the list_datasets function to browse what datasets are available.

from molflux.datasets import list_datasets

catalogue = list_datasets()

print(catalogue)
{'core': ['ani1x', 'ani2x', 'esol', 'gdb9', 'pcqm4m_v2', 'spice'], 'tdc': ['tdc_admet_benchmarks']}

This returns a list of available datasets by their name.

Tip

On top of these drug discovery datasets, you can also access all datasets from the HuggingFace registry (for example, the MNIST dataset). Follow along with the rest of this page with your favourite dataset from there!

Loading datasets#

Loading using load_dataset#

Loading a dataset is simple. You just need to run load_dataset with a given dataset name:

from molflux.datasets import load_dataset

dataset = load_dataset('esol')
print(dataset)
Repo card metadata block was not found. Setting CardData to empty.
Dataset({
    features: ['smiles', 'log_solubility'],
    num_rows: 1126
})

By printing the loaded dataset, you can see minimal information about it like the column names and number of datapoints.

Tip

You can also see more information about the dataset from its dataset.info.

Loading using load_from_dict#

Datasets can also be loaded by specifying a config dictionary. A config dictionary must have the following format

dataset_config_dict = {
    'name': '<name of the dataset>',
    'config': '<kwargs for instantiating dataset>'
}

The name key specifies the name of the dataset to load from the catalogue. The config key specifies the arguments that are needed for instantiating the dataset. The dataset can then be loaded by doing

from molflux.datasets import load_from_dict

config = {
    'name': 'esol',
}

dataset = load_from_dict(config)
dataset
Repo card metadata block was not found. Setting CardData to empty.
Dataset({
    features: ['smiles', 'log_solubility'],
    num_rows: 1126
})

Loading using load_from_dicts#

For convenience, you can also load a group of datasets all at once by specifying a list of configs.

from molflux.datasets import load_from_dicts

config = [
        {
            'name': 'esol',
        },
        {
            'name': 'esol',
        }
]

datasets = load_from_dicts(config)
print(datasets)
Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.
{'dataset-0': Dataset({
    features: ['smiles', 'log_solubility'],
    num_rows: 1126
}), 'dataset-1': Dataset({
    features: ['smiles', 'log_solubility'],
    num_rows: 1126
})}

Loading using load_from_yaml#

Finally, you can load datasets from a yaml file. You can use a single yaml file which includes configs for all the molflux submodules, and the molflux.datasets.load_from_yaml will know how to extract the relevant part it needs for the dataset. To do so, you need to define a yaml file with the following example format

---
version: v1
kind: datasets
specs:
    - name: esol
...

It consists of a version (this is the version of the config format, for now just v1), kind of config (in this case datasets), and specs. specs is where the dataset initialisation keyword arguments are defined. The yaml file can include configs for other molflux packages as well (see Standard API for more info). To load this yaml file, you can simply do

from molflux.datasets import load_from_yaml

datasets = load_from_yaml(path_to_yaml_file)

Working with datasets#

molflux.datasets was designed to supplement the HuggingFace datasets library, giving you access to our additional catalogue of datasets and to a number of convenient utility functions. The datasets returned by e.g. molflux.datasets.load_dataset() are actually native HuggingFace datasets, with all of the associated functionality.

You can find complete documentation on how to work with HuggingFace datasets online, or check out their official training course! The rest of this tutorial will show a couple of examples of some of the most basic functionalities available.

You can inspect individual datapoints and get the column names:

from molflux.datasets import load_dataset

dataset = load_dataset('esol')

print(dataset[123])
print(dataset.column_names)
Repo card metadata block was not found. Setting CardData to empty.
{'smiles': 'Oc1cccc(c1)N(=O)=O', 'log_solubility': -1.01}
['smiles', 'log_solubility']

Adding columns is easily done by:

from molflux.datasets import load_dataset

dataset = load_dataset('esol')

dataset = dataset.add_column("my_new_column", list(range(len(dataset))))

print(dataset)
Repo card metadata block was not found. Setting CardData to empty.
Dataset({
    features: ['smiles', 'log_solubility', 'my_new_column'],
    num_rows: 1126
})

You can also transform the dataset into a pandas DataFrame:

from molflux.datasets import load_dataset

dataset = load_dataset('esol')

dataset.to_pandas()
Repo card metadata block was not found. Setting CardData to empty.
smiles log_solubility
0 OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)... -0.770
1 Cc1occc1C(=O)Nc2ccccc2 -3.300
2 CC(C)=CCCC(C)=CC(=O) -2.060
3 c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43 -7.870
4 c1ccsc1 -1.330
... ... ...
1121 FC(F)(F)C(Cl)Br -1.710
1122 CNC(=O)ON=C(SC)C(=O)N(C)C 0.106
1123 CCSCCSP(=S)(OC)OC -3.091
1124 CCC(C)C -3.180
1125 COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl -4.522

1126 rows × 2 columns

You can also save and load the datasets to disk or to the cloud (s3)

from molflux.datasets import load_dataset, load_dataset_from_store, save_dataset_to_store

dataset = load_dataset('esol')

save_dataset_to_store(dataset, "my/data/dataset.parquet")

dataset = load_dataset_from_store("my/data/dataset.parquet")

See also

For more information on how to save, load (from disk), featurise, and split datasets, see these guides: