Featurising#
The molflux
modules are built to plug into each other seamlessly. If you would like to featurise your datasets
with molflux.features
representations (or any other representation following the same API), you can easily do so as follows:
from molflux.datasets import load_dataset, featurise_dataset
dataset = load_dataset('esol')
# representations = <your molflux.features representations>
featurised_dataset = featurise_dataset.featurise_dataset(
dataset=dataset,
column="<column to be featurised>",
representations=representations
)
This returns a new datasets with the required features as new columns (if you use multiple representations all at once using
the load_from_dicts
method of molflux.features
, then they will each create a new column with their computed features).
Under the hood, this is done using the map
functionality of HuggingFace datasets. You can pass some kwargs
to control
the featurisation. The full set of kwargs
can be found here
but the most useful ones are
batch_size Optional[int] = 1000
: the size of the batches.num_proc: Optional[int] = None
: maximum number of processes for featurisation.
See also
There is also a complete workflow example that also covers how datasets and featurization are integrated: ESOL Training.
Tweaking the featurised columns’ names#
By default, the featurised columns names will encode information both about the feature name and the name of the source column that was featurised. This allows you to keep track of how your dataset columns have been featurised over time, and provides uniquely identifiable column names even for columns featurised by the same representations.
If needed, you can also assign custom display names through the display_names
argument, which should be a nested list
of display names for each representation that you are applying:
from molflux.datasets import load_dataset, featurise_dataset
from molflux.features import load_from_dicts as load_reps_from_dicts
dataset = load_dataset('esol')
representations = load_reps_from_dicts(
[
{"name": "morgan"},
{"name": "character_count"},
{"name": "maccs_rdkit"},
]
)
display_names = [["my_morgan_fingerprint"], ["my_character_count"], [None]]
featurised_dataset = featurise_dataset(
dataset=dataset,
column="smiles",
representations=representations,
display_names=display_names
)
print(featurised_dataset.column_names)
Repo card metadata block was not found. Setting CardData to empty.
['smiles', 'log_solubility', 'my_morgan_fingerprint', 'my_character_count', 'smiles::maccs_rdkit']
where None
can be used as a placeholder for features for which you don’t need to set a custom display name (a
custom naming template will be applied).
The display_names
argument can also accept a templated string that will be dynamically injected with context available
at runtime. This is useful if you would like the datasets to be featurised according to a specific formatting convention:
display_names = "{source_column}>>{feature_name}"
featurised_dataset = featurise_dataset(
dataset=dataset,
column="smiles",
representations=representations,
display_names=display_names
)
print(featurised_dataset.column_names)
['smiles', 'log_solubility', 'smiles>>morgan', 'smiles>>character_count', 'smiles>>maccs_rdkit']
For the time being, source_column
and feature_name
are the only keys that can be requested from the context.
You can also mix and match the features shown above:
display_names = [["my_circular_fingerprint"], ["{source_column}>>{feature_name}"], [None]]
featurised_dataset = featurise_dataset(
dataset=dataset,
column="smiles",
representations=representations,
display_names=display_names
)
print(featurised_dataset.column_names)
['smiles', 'log_solubility', 'my_circular_fingerprint', 'smiles>>character_count', 'smiles::maccs_rdkit']