Basic usage#
In this section, we will quickly illustrate how to use molflux.datasets
. These examples will provide you with a starting
point. Much of the low level functionality is already documented in the HuggingFace dataset
docs.
Here, we will go through the basics and the added functionality from molflux
.
Browsing#
First, we use the list_datasets
function to browse what datasets are available.
from molflux.datasets import list_datasets
catalogue = list_datasets()
print(catalogue)
{'core': ['ani1x', 'ani2x', 'esol', 'gdb9', 'pcqm4m_v2', 'spice'], 'tdc': ['tdc_admet_benchmarks']}
This returns a list of available datasets by their name.
Tip
On top of these drug discovery datasets, you can also access all datasets from the HuggingFace registry (for example, the MNIST dataset). Follow along with the rest of this page with your favourite dataset from there!
Loading datasets#
Loading using load_dataset
#
Loading a dataset is simple. You just need to run load_dataset
with a given dataset name:
from molflux.datasets import load_dataset
dataset = load_dataset('esol')
print(dataset)
Repo card metadata block was not found. Setting CardData to empty.
Dataset({
features: ['smiles', 'log_solubility'],
num_rows: 1126
})
By printing the loaded dataset, you can see minimal information about it like the column names and number of datapoints.
Tip
You can also see more information about the dataset from its dataset.info
.
Loading using load_from_dict
#
Datasets can also be loaded by specifying a config dictionary. A config dictionary must have the following format
dataset_config_dict = {
'name': '<name of the dataset>',
'config': '<kwargs for instantiating dataset>'
}
The name
key specifies the name
of the dataset to load from the catalogue. The config
key
specifies the arguments that are needed for instantiating the dataset. The dataset can then be loaded by doing
from molflux.datasets import load_from_dict
config = {
'name': 'esol',
}
dataset = load_from_dict(config)
dataset
Repo card metadata block was not found. Setting CardData to empty.
Dataset({
features: ['smiles', 'log_solubility'],
num_rows: 1126
})
Loading using load_from_dicts
#
For convenience, you can also load a group of datasets all at once by specifying a list of configs.
from molflux.datasets import load_from_dicts
config = [
{
'name': 'esol',
},
{
'name': 'esol',
}
]
datasets = load_from_dicts(config)
print(datasets)
Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.
{'dataset-0': Dataset({
features: ['smiles', 'log_solubility'],
num_rows: 1126
}), 'dataset-1': Dataset({
features: ['smiles', 'log_solubility'],
num_rows: 1126
})}
Loading using load_from_yaml
#
Finally, you can load datasets from a yaml file. You can use a single yaml file which includes configs for all the molflux
submodules, and the molflux.datasets.load_from_yaml
will know how to extract the relevant part it needs for the dataset.
To do so, you need to define a yaml file with the following example format
---
version: v1
kind: datasets
specs:
- name: esol
...
It consists of a version (this is the version of the config format, for now just v1
), kind
of config (in this case
datasets
), and specs
. specs
is where the dataset initialisation keyword arguments are defined. The yaml file can include
configs for other molflux
packages as well (see Standard API for more info).
To load this yaml file, you can simply do
from molflux.datasets import load_from_yaml
datasets = load_from_yaml(path_to_yaml_file)
Working with datasets#
molflux.datasets
was designed to supplement the HuggingFace datasets library,
giving you access to our additional catalogue of datasets and to a number of convenient utility functions. The datasets
returned by e.g. molflux.datasets.load_dataset()
are actually native HuggingFace datasets, with all of the associated
functionality.
You can find complete documentation on how to work with HuggingFace datasets online, or check out their official training course! The rest of this tutorial will show a couple of examples of some of the most basic functionalities available.
You can inspect individual datapoints and get the column names:
from molflux.datasets import load_dataset
dataset = load_dataset('esol')
print(dataset[123])
print(dataset.column_names)
Repo card metadata block was not found. Setting CardData to empty.
{'smiles': 'Oc1cccc(c1)N(=O)=O', 'log_solubility': -1.01}
['smiles', 'log_solubility']
Adding columns is easily done by:
from molflux.datasets import load_dataset
dataset = load_dataset('esol')
dataset = dataset.add_column("my_new_column", list(range(len(dataset))))
print(dataset)
Repo card metadata block was not found. Setting CardData to empty.
Dataset({
features: ['smiles', 'log_solubility', 'my_new_column'],
num_rows: 1126
})
You can also transform the dataset into a pandas DataFrame:
from molflux.datasets import load_dataset
dataset = load_dataset('esol')
dataset.to_pandas()
Repo card metadata block was not found. Setting CardData to empty.
smiles | log_solubility | |
---|---|---|
0 | OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)... | -0.770 |
1 | Cc1occc1C(=O)Nc2ccccc2 | -3.300 |
2 | CC(C)=CCCC(C)=CC(=O) | -2.060 |
3 | c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43 | -7.870 |
4 | c1ccsc1 | -1.330 |
... | ... | ... |
1121 | FC(F)(F)C(Cl)Br | -1.710 |
1122 | CNC(=O)ON=C(SC)C(=O)N(C)C | 0.106 |
1123 | CCSCCSP(=S)(OC)OC | -3.091 |
1124 | CCC(C)C | -3.180 |
1125 | COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl | -4.522 |
1126 rows × 2 columns
You can also save and load the datasets to disk or to the cloud (s3)
from molflux.datasets import load_dataset, load_dataset_from_store, save_dataset_to_store
dataset = load_dataset('esol')
save_dataset_to_store(dataset, "my/data/dataset.parquet")
dataset = load_dataset_from_store("my/data/dataset.parquet")
See also
For more information on how to save, load (from disk), featurise, and split datasets, see these guides: