Featurisation

Featurisation#

The production featurisation metadata is built on top of the molflux.features module standard config format to ensure that the featurisation process performed at model training time, can be replicated across all downstream environments - for example, when performing real-time model serving through a REST API or when querying models for local batch inference.

Featurisation Metadata#

The production featurisation metadata schema is a higher-level wrapper around the molflux.features featurisation configs.

Crucially, it defines a schema version (the only valid value at present is 1, but having this key allows the schema to evolve while still preserving backwards compatibility), and supplements the familiar molflux.features configs listed under the config key with the as field which lets you assign custom output names to the features generated by a given representation:

{
    "version": 1,
    "config": [
        {
            "column": "smiles",
            "representations": [
                {
                    "name": "morgan",
                    "as": "ecfp6"
                },
                {
                    "name": "character_count",
                    "as": "{feature_name}",
                    "config": {},
                    "presets": {}
                }
            ]
        }
    ]
}

Featurisation API#

The productionising feastusiation API helps you handle dataset featurisation when training your machine learning models. Featurisation happens on your local machine. This means that you are in charge of making sure that your environment is able to support all of the featurisation methods required by your model.

To featurise your dataset from featurisation metadata:

from molflux.core import featurise_dataset

# featurisation_metadata = <your-featurisation-metadata>
# dataset = <dataset-to-featurise>

featurised_dataset = featurise_dataset(dataset, featurisation_metadata=featurisation_metadata)

You can also pass any map_kwargs that the HuggingFace map method takes in the featurise_dataset function, for example:

featurised_dataset = featurise_dataset(dataset, featurisation_metadata=featurisation_metadata, num_proc=4, batch_size=100)

Tip

When saving a model, make sure to pass along your featurisation metadata to ensure that the featurisation process can be replayed downstream.

See model saving for more info.

As a user, you can then automatically replay the featurisation process for any given model that has been saved:

from molflux.core import replay_dataset_featurisation

# samples = <my-arrow-like-dataset>
# model_path = <path-to-the-model-of-interest>

featurised_samples = replay_dataset_featurisation(samples, model_path=model_path)

You can pass any map_kwargs here as well!