Featurisation#
The production featurisation metadata is built on top of the molflux.features
module standard config format to ensure
that the featurisation process performed at model training time, can be replicated across all downstream environments -
for example, when performing real-time model serving through a REST API or when querying models for local batch inference.
Featurisation Metadata#
The production featurisation metadata schema is a higher-level wrapper around the molflux.features
featurisation configs.
Crucially, it defines a schema version
(the only valid value at present is 1
, but having
this key allows the schema to evolve while still preserving backwards compatibility), and supplements
the familiar molflux.features
configs listed under the config
key with the as
field which lets you assign
custom output names to the features generated by a given representation:
{
"version": 1,
"config": [
{
"column": "smiles",
"representations": [
{
"name": "morgan",
"as": "ecfp6"
},
{
"name": "character_count",
"as": "{feature_name}",
"config": {},
"presets": {}
}
]
}
]
}
Featurisation API#
The productionising feastusiation
API helps you handle dataset featurisation when training your machine learning models.
Featurisation happens on your local machine. This means that you are in charge of making sure that your environment
is able to support all of the featurisation methods required by your model.
To featurise your dataset from featurisation metadata:
from molflux.core import featurise_dataset
# featurisation_metadata = <your-featurisation-metadata>
# dataset = <dataset-to-featurise>
featurised_dataset = featurise_dataset(dataset, featurisation_metadata=featurisation_metadata)
You can also pass any map_kwargs
that the HuggingFace map
method takes in the featurise_dataset
function, for example:
featurised_dataset = featurise_dataset(dataset, featurisation_metadata=featurisation_metadata, num_proc=4, batch_size=100)
Tip
When saving a model, make sure to pass along your featurisation metadata to ensure that the featurisation process can be replayed downstream.
See model saving for more info.
As a user, you can then automatically replay the featurisation process for any given model that has been saved:
from molflux.core import replay_dataset_featurisation
# samples = <my-arrow-like-dataset>
# model_path = <path-to-the-model-of-interest>
featurised_samples = replay_dataset_featurisation(samples, model_path=model_path)
You can pass any map_kwargs
here as well!