Basic usage#
In this section, we will illustrate how to use molflux.modelzoo
. These examples will provide you with a starting
point.
Browsing#
First, we review which what model architectures are available for use. To view what’s available you can do
from molflux.modelzoo import list_models
catalogue = list_models()
print(catalogue)
{'catboost': ['cat_boost_classifier', 'cat_boost_regressor'], 'core': ['average_features_regressor', 'average_regressor'], 'ensemble': ['ensemble_classifier', 'ensemble_regressor'], 'fortuna': ['fortuna_mlp_regressor'], 'lightning': ['lightning_mlp_regressor'], 'lightning_gp': ['lightning_gp_regressor'], 'mapie': ['mapie_regressor'], 'pyod': ['abod_detector', 'cblof_detector', 'hbos_detector', 'isolation_forest_detector', 'knn_detector', 'mcd_detector', 'ocsvm_detector', 'pca_detector'], 'pystan': ['ordinal_classifier', 'sparse_linear_regressor'], 'sklearn': ['bernoulli_nb_classifier', 'corrected_nb_classifier', 'coverage_nb_classifier', 'dummy_classifier', 'extra_trees_classifier', 'extra_trees_regressor', 'gradient_boosting_classifier', 'gradient_boosting_regressor', 'kernel_ridge_regressor', 'knn_classifier', 'knn_regressor', 'linear_discriminant_analysis_classifier', 'linear_regressor', 'logistic_regressor', 'mlp_classifier', 'mlp_regressor', 'pipeline_pilot_nb_classifier', 'pls_regressor', 'random_forest_classifier', 'random_forest_regressor', 'ridge_regressor', 'sklearn_pipeline_classifier', 'sklearn_pipeline_regressor', 'support_vector_classifier', 'support_vector_regressor'], 'xgboost': ['xg_boost_classifier', 'xg_boost_regressor']}
This returns our catalogue of available model architectures (organised by the dependencies they rely on). There are a few to choose from.
See also
How to add your own model if you would like to add your own model to the catalogue
For instance, molflux.modelzoo.list_models()
returns as one item in the dictionary:
'xgboost': ['xg_boost_classifier', 'xg_boost_regressor']
. In order to be able to use the two models xg_boost_classifier
and xg_boost_regressor
, you would do: pip install molflux[xgboost]
.
Loading a model architecture#
Loading a model architecture of your choice is simple. For example, to load a random_forest_regressor
from the
catalogue:
from molflux.modelzoo import load_model
model = load_model(name="random_forest_regressor")
print(model)
Model(
name: "random_forest_regressor",
tag: "random_forest_regressor",
description: """
This is an sklearn random forest regressor model.
A random forest is a meta estimator that fits a number of classifying
decision trees on various sub-samples of the dataset and uses averaging
to improve the predictive accuracy and control over-fitting.
The sub-sample size is controlled with the `max_samples` parameter if
`bootstrap=True` (default), otherwise the whole dataset is used to build
each tree.
""",
config signature: __init__(self, x_features: list[str] = <factory>, y_features: list[str] = <factory>, train_features: list[str] | dict[str, list[str]] | None = None, n_estimators: int = 100, criterion: Literal['squared_error', 'absolute_error', 'friedman_mse', 'poisson'] = 'squared_error', max_depth: int | None = None, min_samples_split: int | float = 2, min_samples_leaf: int | float = 1, min_weight_fraction_leaf: float = 0.0, max_features: Union[float, Literal['sqrt', 'log2'], NoneType] = 1.0, max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0.0, bootstrap: bool = True, oob_score: bool = False, n_jobs: int | None = None, random_state: int | numpy.random.mtrand.RandomState | None = None, verbose: int = 0, warm_start: bool = False, ccp_alpha: float = 0.0, max_samples: int | float | None = None) -> None,
config: """
Parameters
----------
n_estimators : int, default=100
The number of trees in the forest.
criterion : {"squared_error", "absolute_error", "friedman_mse", "poisson"}, default="squared_error"
The function to measure the quality of a split. Supported criteria
are "squared_error" for the mean squared error, which is equal to
variance reduction as feature selection criterion and minimizes the L2
loss using the mean of each terminal node, "friedman_mse", which uses
mean squared error with Friedman's improvement score for potential
splits, "absolute_error" for the mean absolute error, which minimizes
the L1 loss using the median of each terminal node, and "poisson" which
uses reduction in Poisson deviance to find splits.
Training using "absolute_error" is significantly slower
than when using "squared_error".
max_depth : int, default=None
The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
min_samples_split : int or float, default=2
The minimum number of samples required to split an internal node:
- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a fraction and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.
min_samples_leaf : int or float, default=1
The minimum number of samples required to be at a leaf node.
A split point at any depth will only be considered if it leaves at
least ``min_samples_leaf`` training samples in each of the left and
right branches. This may have the effect of smoothing the model,
especially in regression.
- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a fraction and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.
min_weight_fraction_leaf : float, default=0.0
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.
max_features : {"sqrt", "log2", None}, int or float, default=1.0
The number of features to consider when looking for the best split:
- If int, then consider `max_features` features at each split.
- If float, then `max_features` is a fraction and
`round(max_features * n_features)` features are considered at each
split.
- If "auto", then `max_features=n_features`.
- If "sqrt", then `max_features=sqrt(n_features)`.
- If "log2", then `max_features=log2(n_features)`.
- If None or 1.0, then `max_features=n_features`.
Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
max_leaf_nodes : int, default=None
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
min_impurity_decrease : float, default=0.0
A node will be split if this split induces a decrease of the impurity
greater than or equal to this value.
The weighted impurity decrease equation is the following::
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)
where ``N`` is the total number of samples, ``N_t`` is the number of
samples at the current node, ``N_t_L`` is the number of samples in the
left child, and ``N_t_R`` is the number of samples in the right child.
``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
if ``sample_weight`` is passed.
bootstrap : bool, default=True
Whether bootstrap samples are used when building trees. If False, the
whole dataset is used to build each tree.
oob_score : bool, default=False
Whether to use out-of-bag samples to estimate the generalization score.
Only available if bootstrap=True.
n_jobs : int, default=None
The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`,
:meth:`decision_path` and :meth:`apply` are all parallelized over the
trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend`
context. ``-1`` means using all processors.
random_state : int, RandomState instance or None, default=None
Controls both the randomness of the bootstrapping of the samples used
when building trees (if ``bootstrap=True``) and the sampling of the
features to consider when looking for the best split at each node
(if ``max_features < n_features``).
verbose : int, default=0
Controls the verbosity when fitting and predicting.
warm_start : bool, default=False
When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just fit a whole
new forest.
ccp_alpha : non-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The
subtree with the largest cost complexity that is smaller than
``ccp_alpha`` will be chosen. By default, no pruning is performed.
max_samples : int or float, default=None
If bootstrap is True, the number of samples to draw from X
to train each base estimator.
- If None (default), then draw `X.shape[0]` samples.
- If int, then draw `max_samples` samples.
- If float, then draw `max_samples * X.shape[0]` samples. Thus,
`max_samples` should be in the interval `(0.0, 1.0]`.
""",
train signature: self.train(train_data: datasets.arrow_dataset.Dataset, **kwargs: Any) -> Any,
predict signature: self.predict(data: datasets.arrow_dataset.Dataset, **kwargs: Any) -> dict[str, list[typing.Any]],
)
By printing the loaded model architecture, you get more information about it. Each model has a name
, a tag
(to uniquely identify it in case you would like to generate multiple copies of the same model but with different
configurations), and a set of architecture-specific configuration parameters. You should also be able to view a
short description of the model, and get some extra information about the model’s method signatures.
To load a model with non-default configuration parameters, you can simply supply them at load time:
from molflux.modelzoo import load_model
model = load_model(
name="random_forest_regressor",
tag="my_rf",
x_features=["x1", "x2"],
y_features=["y"],
n_estimators=50
)
# double check your model's architecture configuration
print(model.config)
{'x_features': ['x1', 'x2'], 'y_features': ['y'], 'train_features': None, 'n_estimators': 50, 'criterion': 'squared_error', 'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.0, 'max_features': 1.0, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'bootstrap': True, 'oob_score': False, 'n_jobs': None, 'random_state': None, 'verbose': 0, 'warm_start': False, 'ccp_alpha': 0.0, 'max_samples': None}
With time, you may want to load model architectures using a config-driven approach. To do this, molflux.modelzoo
supports
loading model architectures from dictionaries specifying the model architecture to be loaded and its configuration parameters:
from molflux.modelzoo import load_from_dict
config = {
'name': 'random_forest_regressor',
'config':
{
'tag': "my_rf",
'x_features': ['x1', 'x2'],
'y_features': ['y'],
'n_estimators': 50,
},
}
model = load_from_dict(config)
The name
key specifies the name
of the model architecture to load from the catalogue. The config
key
should hold the dictionary of configuration arguments to initialise the model with (if not specified, the model will use default values).
You can also load multiple models all at once using a list of config dictionaries. This is done as follows
from molflux.modelzoo import load_from_dicts
list_of_configs = [
{
'name': 'random_forest_regressor',
'config':
{
'tag': "my_rf_1",
'x_features': ['x1', 'x2'],
'y_features': ['y'],
'n_estimators': 500,
'max_depth': 10,
},
},
{
'name': 'random_forest_classifier',
'config':
{
'tag': "my_rf_2",
'x_features': ['x1', 'x2'],
'y_features': ['y'],
'n_estimators': 500,
'max_depth': 10,
},
}
]
models = load_from_dicts(list_of_configs)
print(models)
{'my_rf_1': <molflux.modelzoo.models.sklearn.random_forest_regressor.RandomForestRegressor object at 0x7f65eb98edd0>, 'my_rf_2': <molflux.modelzoo.models.sklearn.random_forest_classifier.RandomForestClassifier object at 0x7f65eb920590>}
Finally, you can load models from a yaml file. You can use a single yaml file which includes configs for all the molflux
tools
and molflux.modelzoo
will know how to extract the relevant part it needs. To do so, you need to define a yaml file with the
following example format
---
version: v1
kind: models
specs:
- name: random_forest_regressor
config:
tag: my_rf_1
x_features:
- x1
- x2
y_features:
- y1
n_estimators: 500
- name: random_forest_classifier
config:
tag: my_rf_2
x_features:
- x1
- x2
y_features:
- y1
n_estimators: 300
...
It consists of a version (this is the version of the config format, for now just v1
), kind
of config (in this case
models
), and specs
. specs
is where the configs are defined. The yaml file can include
configs for other molflux
modules as well. To load the model from the yaml file, you can simply do
from molflux.modelzoo import load_from_yaml
models = load_from_yaml(path_to_yaml_file)
print(models)
Training/Inferencing a model#
All models in molflux.modelzoo
have train
and predict
methods. These are the main two methods you need to
interact with.
Training#
After loading a model architecture, you can train it on a dataset using the model’s train()
method, to which you
should feed your training dataset and optional training arguments (if any are specified by the model architecture
of your choice).
Note
Our model’s interfaces accept dataframe-like objects that implement the Dataframe Interchange Protocol as input data: these include pandas dataframes, pyarrow tables, vaex dataframes, cudf dataframes, and many other popular dataframe libraries… We also support HuggingFace datasets as inputs for seamless integration with our datasets users. If you are used to working with other in-memory data representations, you will need to convert them before feeding them to our models. Please contact us if you need support with, your workflows.
For example, we can train our random_forest_regressor
as follows:
import datasets
from molflux.modelzoo import load_model
model = load_model(
name="random_forest_regressor",
x_features=["x1", "x2"],
y_features=["y"],
n_estimators=50
)
train_data = datasets.Dataset.from_dict(
{
"x1": [0, 1, 2, 3, 4, 5],
"x2": [0, -1, -2, -3, -4, -5],
"y": [2, 4, 6, 8, 10, 12],
}
)
model.train(train_data)
And the model is trained!
A pandas dataframe would have also worked in this case - although we recommend switching to dataframe libraries backed by
apache arrow (like pyarrow
, or datasets
shown above), as not all pandas column dtypes can be cast to arrow:
import pandas as pd
from molflux.modelzoo import load_model
model = load_model(
name="random_forest_regressor",
x_features=["x1", "x2"],
y_features=["y"],
n_estimators=50
)
train_data = pd.DataFrame(
{
"x1": [0, 1, 2, 3, 4, 5],
"x2": [0, -1, -2, -3, -4, -5],
"y": [2, 4, 6, 8, 10, 12],
}
)
model.train(train_data)
Tip
To disable progress bars you can call datasets.disable_progress_bar()
anywhere in your script.
Inferencing#
Once a model is trained, you can use it for inference using the model’s predict()
method, to which you should
feed the dataset you would like to get predictions for:
import datasets
test_data = datasets.Dataset.from_dict(
{
"x1": [10, 12],
"x2": [-2.5, -1]
}
)
predictions = model.predict(test_data)
print(predictions)
This returns a dictionary of your model’s predictions! Models can also support different inference methods. For example,
some classification models support the predict_proba
method which returns the probabilities of the classes
probabilities = model.predict_proba(test_data)
Saving/Loading a model#
Once you have trained your model, you can save it and load it for later use.
Saving#
To save a model, all you have to do is
from molflux.modelzoo import save_to_store
save_to_store("path_to_my_model/", model)
The save_to_store
function takes the path and the model to save. It can save to local disk or to an s3 location.
Note
For models intended for production level usage, we recommend that they are saved as described in the productionising section. Along with the model, this also saves the featurisation metadata and a snapshot of the environment the model was built in.
Loading#
To load, you simply need to do
from molflux.modelzoo import load_from_store
model = load_from_store("path_to_my_model/")
This can load from local disk and s3.