Basic usage#

In this section, we will illustrate how to use molflux.modelzoo. These examples will provide you with a starting point.

Browsing#

First, we review which what model architectures are available for use. To view what’s available you can do

from molflux.modelzoo import list_models

catalogue = list_models()

print(catalogue)
{'catboost': ['cat_boost_classifier', 'cat_boost_regressor'], 'core': ['average_features_regressor', 'average_regressor'], 'ensemble': ['ensemble_classifier', 'ensemble_regressor'], 'fortuna': ['fortuna_mlp_regressor'], 'lightning': ['lightning_mlp_regressor'], 'lightning_gp': ['lightning_gp_regressor'], 'mapie': ['mapie_regressor'], 'pyod': ['abod_detector', 'cblof_detector', 'hbos_detector', 'isolation_forest_detector', 'knn_detector', 'mcd_detector', 'ocsvm_detector', 'pca_detector'], 'pystan': ['ordinal_classifier', 'sparse_linear_regressor'], 'sklearn': ['bernoulli_nb_classifier', 'corrected_nb_classifier', 'coverage_nb_classifier', 'dummy_classifier', 'extra_trees_classifier', 'extra_trees_regressor', 'gradient_boosting_classifier', 'gradient_boosting_regressor', 'kernel_ridge_regressor', 'knn_classifier', 'knn_regressor', 'linear_discriminant_analysis_classifier', 'linear_regressor', 'logistic_regressor', 'mlp_classifier', 'mlp_regressor', 'pipeline_pilot_nb_classifier', 'pls_regressor', 'random_forest_classifier', 'random_forest_regressor', 'ridge_regressor', 'sklearn_pipeline_classifier', 'sklearn_pipeline_regressor', 'support_vector_classifier', 'support_vector_regressor'], 'xgboost': ['xg_boost_classifier', 'xg_boost_regressor']}

This returns our catalogue of available model architectures (organised by the dependencies they rely on). There are a few to choose from.

See also

How to add your own model if you would like to add your own model to the catalogue

For instance, molflux.modelzoo.list_models() returns as one item in the dictionary: 'xgboost': ['xg_boost_classifier', 'xg_boost_regressor']. In order to be able to use the two models xg_boost_classifier and xg_boost_regressor, you would do: pip install molflux[xgboost].

Loading a model architecture#

Loading a model architecture of your choice is simple. For example, to load a random_forest_regressor from the catalogue:

from molflux.modelzoo import load_model

model = load_model(name="random_forest_regressor")

print(model)
Model(
	name: "random_forest_regressor",
	tag: "random_forest_regressor",
	description: """
This is an sklearn random forest regressor model.

A random forest is a meta estimator that fits a number of classifying
decision trees on various sub-samples of the dataset and uses averaging
to improve the predictive accuracy and control over-fitting.
The sub-sample size is controlled with the `max_samples` parameter if
`bootstrap=True` (default), otherwise the whole dataset is used to build
each tree.
""",
	config signature: __init__(self, x_features: list[str] = <factory>, y_features: list[str] = <factory>, train_features: list[str] | dict[str, list[str]] | None = None, n_estimators: int = 100, criterion: Literal['squared_error', 'absolute_error', 'friedman_mse', 'poisson'] = 'squared_error', max_depth: int | None = None, min_samples_split: int | float = 2, min_samples_leaf: int | float = 1, min_weight_fraction_leaf: float = 0.0, max_features: Union[float, Literal['sqrt', 'log2'], NoneType] = 1.0, max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0.0, bootstrap: bool = True, oob_score: bool = False, n_jobs: int | None = None, random_state: int | numpy.random.mtrand.RandomState | None = None, verbose: int = 0, warm_start: bool = False, ccp_alpha: float = 0.0, max_samples: int | float | None = None) -> None,
	config: """
Parameters
----------
n_estimators : int, default=100
    The number of trees in the forest.
criterion : {"squared_error", "absolute_error", "friedman_mse", "poisson"},         default="squared_error"
    The function to measure the quality of a split. Supported criteria
    are "squared_error" for the mean squared error, which is equal to
    variance reduction as feature selection criterion and minimizes the L2
    loss using the mean of each terminal node, "friedman_mse", which uses
    mean squared error with Friedman's improvement score for potential
    splits, "absolute_error" for the mean absolute error, which minimizes
    the L1 loss using the median of each terminal node, and "poisson" which
    uses reduction in Poisson deviance to find splits.
    Training using "absolute_error" is significantly slower
    than when using "squared_error".
max_depth : int, default=None
    The maximum depth of the tree. If None, then nodes are expanded until
    all leaves are pure or until all leaves contain less than
    min_samples_split samples.
min_samples_split : int or float, default=2
    The minimum number of samples required to split an internal node:
    - If int, then consider `min_samples_split` as the minimum number.
    - If float, then `min_samples_split` is a fraction and
      `ceil(min_samples_split * n_samples)` are the minimum
      number of samples for each split.
min_samples_leaf : int or float, default=1
    The minimum number of samples required to be at a leaf node.
    A split point at any depth will only be considered if it leaves at
    least ``min_samples_leaf`` training samples in each of the left and
    right branches.  This may have the effect of smoothing the model,
    especially in regression.
    - If int, then consider `min_samples_leaf` as the minimum number.
    - If float, then `min_samples_leaf` is a fraction and
      `ceil(min_samples_leaf * n_samples)` are the minimum
      number of samples for each node.
min_weight_fraction_leaf : float, default=0.0
    The minimum weighted fraction of the sum total of weights (of all
    the input samples) required to be at a leaf node. Samples have
    equal weight when sample_weight is not provided.
max_features : {"sqrt", "log2", None}, int or float, default=1.0
    The number of features to consider when looking for the best split:
    - If int, then consider `max_features` features at each split.
    - If float, then `max_features` is a fraction and
      `round(max_features * n_features)` features are considered at each
      split.
    - If "auto", then `max_features=n_features`.
    - If "sqrt", then `max_features=sqrt(n_features)`.
    - If "log2", then `max_features=log2(n_features)`.
    - If None or 1.0, then `max_features=n_features`.
    Note: the search for a split does not stop until at least one
    valid partition of the node samples is found, even if it requires to
    effectively inspect more than ``max_features`` features.
max_leaf_nodes : int, default=None
    Grow trees with ``max_leaf_nodes`` in best-first fashion.
    Best nodes are defined as relative reduction in impurity.
    If None then unlimited number of leaf nodes.
min_impurity_decrease : float, default=0.0
    A node will be split if this split induces a decrease of the impurity
    greater than or equal to this value.
    The weighted impurity decrease equation is the following::
        N_t / N * (impurity - N_t_R / N_t * right_impurity
                            - N_t_L / N_t * left_impurity)
    where ``N`` is the total number of samples, ``N_t`` is the number of
    samples at the current node, ``N_t_L`` is the number of samples in the
    left child, and ``N_t_R`` is the number of samples in the right child.
    ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
    if ``sample_weight`` is passed.
bootstrap : bool, default=True
    Whether bootstrap samples are used when building trees. If False, the
    whole dataset is used to build each tree.
oob_score : bool, default=False
    Whether to use out-of-bag samples to estimate the generalization score.
    Only available if bootstrap=True.
n_jobs : int, default=None
    The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`,
    :meth:`decision_path` and :meth:`apply` are all parallelized over the
    trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend`
    context. ``-1`` means using all processors.
random_state : int, RandomState instance or None, default=None
    Controls both the randomness of the bootstrapping of the samples used
    when building trees (if ``bootstrap=True``) and the sampling of the
    features to consider when looking for the best split at each node
    (if ``max_features < n_features``).
verbose : int, default=0
    Controls the verbosity when fitting and predicting.
warm_start : bool, default=False
    When set to ``True``, reuse the solution of the previous call to fit
    and add more estimators to the ensemble, otherwise, just fit a whole
    new forest.
ccp_alpha : non-negative float, default=0.0
    Complexity parameter used for Minimal Cost-Complexity Pruning. The
    subtree with the largest cost complexity that is smaller than
    ``ccp_alpha`` will be chosen. By default, no pruning is performed.
max_samples : int or float, default=None
    If bootstrap is True, the number of samples to draw from X
    to train each base estimator.
    - If None (default), then draw `X.shape[0]` samples.
    - If int, then draw `max_samples` samples.
    - If float, then draw `max_samples * X.shape[0]` samples. Thus,
      `max_samples` should be in the interval `(0.0, 1.0]`.
""",
	train signature: self.train(train_data: datasets.arrow_dataset.Dataset, **kwargs: Any) -> Any,
	predict signature: self.predict(data: datasets.arrow_dataset.Dataset, **kwargs: Any) -> dict[str, list[typing.Any]],
)

By printing the loaded model architecture, you get more information about it. Each model has a name, a tag (to uniquely identify it in case you would like to generate multiple copies of the same model but with different configurations), and a set of architecture-specific configuration parameters. You should also be able to view a short description of the model, and get some extra information about the model’s method signatures.

To load a model with non-default configuration parameters, you can simply supply them at load time:

from molflux.modelzoo import load_model

model = load_model(
  name="random_forest_regressor",
  tag="my_rf",
  x_features=["x1", "x2"],
  y_features=["y"],
  n_estimators=50
)

# double check your model's architecture configuration
print(model.config)
{'x_features': ['x1', 'x2'], 'y_features': ['y'], 'train_features': None, 'n_estimators': 50, 'criterion': 'squared_error', 'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.0, 'max_features': 1.0, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'bootstrap': True, 'oob_score': False, 'n_jobs': None, 'random_state': None, 'verbose': 0, 'warm_start': False, 'ccp_alpha': 0.0, 'max_samples': None}

With time, you may want to load model architectures using a config-driven approach. To do this, molflux.modelzoo supports loading model architectures from dictionaries specifying the model architecture to be loaded and its configuration parameters:


from molflux.modelzoo import load_from_dict

config = {
            'name': 'random_forest_regressor',
            'config':
                {
                    'tag': "my_rf",
                    'x_features': ['x1', 'x2'],
                    'y_features': ['y'],
                    'n_estimators': 50,
                },
        }

model = load_from_dict(config)

The name key specifies the name of the model architecture to load from the catalogue. The config key should hold the dictionary of configuration arguments to initialise the model with (if not specified, the model will use default values).

You can also load multiple models all at once using a list of config dictionaries. This is done as follows

from molflux.modelzoo import load_from_dicts

list_of_configs = [
    {
            'name': 'random_forest_regressor',
            'config':
                {
                    'tag': "my_rf_1",
                    'x_features': ['x1', 'x2'],
                    'y_features': ['y'],
                    'n_estimators': 500,
                    'max_depth': 10,
                },
        },
    {
            'name': 'random_forest_classifier',
            'config':
                {
                    'tag': "my_rf_2",
                    'x_features': ['x1', 'x2'],
                    'y_features': ['y'],
                    'n_estimators': 500,
                    'max_depth': 10,
                },
        }
]

models = load_from_dicts(list_of_configs)

print(models)
{'my_rf_1': <molflux.modelzoo.models.sklearn.random_forest_regressor.RandomForestRegressor object at 0x7f65eb98edd0>, 'my_rf_2': <molflux.modelzoo.models.sklearn.random_forest_classifier.RandomForestClassifier object at 0x7f65eb920590>}

Finally, you can load models from a yaml file. You can use a single yaml file which includes configs for all the molflux tools and molflux.modelzoo will know how to extract the relevant part it needs. To do so, you need to define a yaml file with the following example format

---
version: v1
kind: models
specs:
    - name: random_forest_regressor
      config:
        tag: my_rf_1
        x_features:
            - x1
            - x2
        y_features:
            - y1
        n_estimators: 500
    - name: random_forest_classifier
      config:
        tag: my_rf_2
        x_features:
            - x1
            - x2
        y_features:
            - y1
        n_estimators: 300
...

It consists of a version (this is the version of the config format, for now just v1), kind of config (in this case models), and specs. specs is where the configs are defined. The yaml file can include configs for other molflux modules as well. To load the model from the yaml file, you can simply do

from molflux.modelzoo import load_from_yaml

models = load_from_yaml(path_to_yaml_file)

print(models)

Training/Inferencing a model#

All models in molflux.modelzoo have train and predict methods. These are the main two methods you need to interact with.

Training#

After loading a model architecture, you can train it on a dataset using the model’s train() method, to which you should feed your training dataset and optional training arguments (if any are specified by the model architecture of your choice).

Note

Our model’s interfaces accept dataframe-like objects that implement the Dataframe Interchange Protocol as input data: these include pandas dataframes, pyarrow tables, vaex dataframes, cudf dataframes, and many other popular dataframe libraries… We also support HuggingFace datasets as inputs for seamless integration with our datasets users. If you are used to working with other in-memory data representations, you will need to convert them before feeding them to our models. Please contact us if you need support with, your workflows.

For example, we can train our random_forest_regressor as follows:

import datasets
from molflux.modelzoo import load_model

model = load_model(
  name="random_forest_regressor",
  x_features=["x1", "x2"],
  y_features=["y"],
  n_estimators=50
)

train_data = datasets.Dataset.from_dict(
    {
        "x1": [0, 1, 2, 3, 4, 5],
        "x2": [0, -1, -2, -3, -4, -5],
        "y": [2, 4, 6, 8, 10, 12],
    }
)

model.train(train_data)

And the model is trained!

A pandas dataframe would have also worked in this case - although we recommend switching to dataframe libraries backed by apache arrow (like pyarrow, or datasets shown above), as not all pandas column dtypes can be cast to arrow:

import pandas as pd
from molflux.modelzoo import load_model

model = load_model(
  name="random_forest_regressor",
  x_features=["x1", "x2"],
  y_features=["y"],
  n_estimators=50
)

train_data = pd.DataFrame(
    {
        "x1": [0, 1, 2, 3, 4, 5],
        "x2": [0, -1, -2, -3, -4, -5],
        "y": [2, 4, 6, 8, 10, 12],
    }
)

model.train(train_data)

Tip

To disable progress bars you can call datasets.disable_progress_bar() anywhere in your script.

Inferencing#

Once a model is trained, you can use it for inference using the model’s predict() method, to which you should feed the dataset you would like to get predictions for:

import datasets

test_data = datasets.Dataset.from_dict(
    {
        "x1": [10, 12],
        "x2": [-2.5, -1]
    }
)

predictions = model.predict(test_data)
print(predictions)

This returns a dictionary of your model’s predictions! Models can also support different inference methods. For example, some classification models support the predict_proba method which returns the probabilities of the classes

probabilities = model.predict_proba(test_data)

Saving/Loading a model#

Once you have trained your model, you can save it and load it for later use.

Saving#

To save a model, all you have to do is

from molflux.modelzoo import save_to_store

save_to_store("path_to_my_model/", model)

The save_to_store function takes the path and the model to save. It can save to local disk or to an s3 location.

Note

For models intended for production level usage, we recommend that they are saved as described in the productionising section. Along with the model, this also saves the featurisation metadata and a snapshot of the environment the model was built in.

Loading#

To load, you simply need to do

from molflux.modelzoo import load_from_store

model = load_from_store("path_to_my_model/")

This can load from local disk and s3.