Uncertainty for models#
Hopefully you have already read the basic usage and ESOL tutorial and
are now ready to learn about how to use molflux.modelzoo
models that provide uncertainty measures.
Uncertainty#
Uncertainty quantification is critical for ensuring trust in machine learning models and enables techniques such as active learning by identifying which parts of a model contain the most uncertainty.
While every model in molflux.modelzoo
acts as a basic estimator by defining
common functions such as train(data, **kwargs)
and predict(data, **kwargs)
, some regression
models implement additional functionalities:
predict_with_prediction_interval(data, confidence, **kwargs)
- returns prediction intervals along the predictions, ensuring aconfidence
that a prediction is within the corresponding intervalpredict_std(data, **kwargs)
- returns standard deviation values along the predictions, as a measure of how uncertain the model is at each pointsample(data, n_samples, **kwargs)
- returnsn_samples
values for each input, drawn from the underlying distribution modelled for each input. For a given input, the average of the samples should be close to the prediction, while their spread indicates how uncertain the model is about this input.calibrate_uncertainty(data, **kwargs)
- calibrates the uncertainty of this model to an external/validation dataset
You can check whether a model implements any of these methods by using the appropriate supports_*
utility function:
from molflux.modelzoo import load_model, supports_prediction_interval
model = load_model(
name="cat_boost_regressor",
x_features=["x1", "x2"],
y_features=["y"]
)
assert supports_prediction_interval(model)
Similarly, supports_std
, supports_sampling
, and supports_uncertainty_calibration
are also available.
Quick example - CatBoost Models#
A typical example for a model with implemented uncertainty methods is the CatBoost Model. This model architecture can return both a mean and standard deviation prediction.
In the example below, we will train and predict using a CatBoost, and then use some of the functions defined above to get a measure of the model uncertainty.
import datasets
from molflux.modelzoo import load_model
model = load_model(
name="cat_boost_regressor",
x_features=["x1", "x2"],
y_features=["y"]
)
train_dataset = datasets.Dataset.from_dict(
{
"x1": [0, 1, 2, 3, 4, 5],
"x2": [0, -1, -2, -3, -4, -5],
"y": [2, 4, 6, 8, 10, 12],
}
)
model.train(train_dataset)
# Return just the predictions
print(model.predict(train_dataset))
# Return the predictions along a lower and upper bound to the 90% confidence interval
print(model.predict_with_prediction_interval(train_dataset, confidence=0.9))
# Return the predictions along the standard deviation
print(model.predict_with_std(train_dataset))
{'cat_boost_regressor::y': [2.0485728335586764, 4.016775906939269, 6.004134017778156, 7.99636438644607, 9.983561893437876, 11.95137955029383]}
({'cat_boost_regressor::y': [2.0485728335586764, 4.016775906939269, 6.004134017778156, 7.99636438644607, 9.983561893437876, 11.95137955029383]}, {'cat_boost_regressor::y::prediction_interval': [(2.030412640886609, 2.0667330262307435), (4.014609671832753, 4.018942142045784), (6.0040024715944895, 6.004265563961823), (7.99626264708617, 7.996466125805969), (9.981482018814088, 9.985641768061665), (11.933183735147534, 11.969575365440125)]})
({'cat_boost_regressor::y': [2.0485728335586764, 4.016775906939269, 6.004134017778156, 7.99636438644607, 9.983561893437876, 11.95137955029383]}, {'cat_boost_regressor::y::std': [0.011040613203817379, 0.0013169774325333308, 7.997440107223924e-05, 6.185313892498188e-05, 0.0012644739870527513, 0.011062270130394281]})
Models calibrated with model-agnostic uncertainty#
For models that do not have built in uncertainty available, we can make use of methods such as conformal prediction, which provides a simple and effective way to create prediction intervals with guaranteed coverage probability from any predictive model without making assumptions about the data-generating process or model. Links to further resources on conformal prediction are available here
There are two common patterns to generate and calibrate these prediction intervals:
In one go, during training - this is typically done via cross-validation and under-the-hood the training data will be split up into
k
folds with a model fittedk
times.In two steps - first, training an underlying model on training data, then calibrating the uncertainty of it on a validation dataset
Both of these are possible with our Mapie implementation.
Note
This functionality is still a work in progress.
1) Mapie example - in one go#
The main steps to get a model with calibrated uncertainty in this case are:
Instantiate a base modelzoo model object
Instantiate a mapie model
use the base estimator object as the
estimator
objectoptionally, specify a value for
cv
exceptprefit
(see the next example why)
Train the mapie model on any data
Use the model to generate calibrated prediction intervals on new data
import datasets
from molflux.modelzoo import load_model
# create a normal modelzoo model
original_model = load_model(
name="random_forest_regressor",
x_features=["x1", "x2"],
y_features=["y"]
)
train_dataset = datasets.Dataset.from_dict(
{
"x1": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"x2": [0, -1, -2, -3, -4, -5, -6, -7, -8, -9, -10],
"y": [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22],
}
)
# plug a mapie regressor on top
model = load_model("mapie_regressor",
estimator=original_model,
cv=5,
x_features=original_model.x_features,
y_features=original_model.y_features,
)
# train the model on new data
model.train(train_dataset)
model.predict_with_prediction_interval(train_dataset, confidence=0.9)
/home/runner/work/molflux/molflux/.cache/nox/docs_build-3-11/lib/python3.11/site-packages/sklearn/utils/validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
({'mapie_regressor[random_forest_regressor]::y': [3.556363636363636,
4.34909090909091,
6.0672727272727265,
7.912727272727272,
9.598181818181818,
11.956363636363637,
14.314545454545454,
16.078181818181818,
17.996363636363636,
19.66181818181818,
20.52]},
{'mapie_regressor[random_forest_regressor]::y::prediction_interval': [(0.22000000000000153,
7.960000000000001),
(1.0400000000000014, 7.960000000000001),
(3.080000000000002, 9.08),
(5.199999999999999, 11.219999999999999),
(7.08, 13.040000000000001),
(8.86, 14.82),
(10.92, 16.880000000000003),
(12.42, 18.979999999999997),
(15.220000000000002, 21.400000000000002),
(16.200000000000003, 22.8),
(16.200000000000003, 23.7)]})
2) Mapie example - in two steps#
The main steps to get a model with calibrated uncertainty in this case are:
Instantiate a base modelzoo model object
Train the base model on some training data
Instantiate a mapie model
use the base, already trained object as the
estimator
objectset the
cv
argument asprefit
Calibrate the mapie model on some validation data
Use the model to generate calibrated prediction intervals on new data
import datasets
from molflux.modelzoo import load_model
# create a normal modelzoo model
original_model = load_model(
name="random_forest_regressor",
x_features=["x1", "x2"],
y_features=["y"]
)
train_dataset = datasets.Dataset.from_dict(
{
"x1": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"x2": [0, -1, -2, -3, -4, -5, -6, -7, -8, -9, -10],
"y": [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22],
}
)
validation_dataset = datasets.Dataset.from_dict(
{
"x1": [-10, -5, 0, 5, 10, 15],
"x2": [10, 0, 0, 0, -10, -15],
"y": [-18, -13, 2, 17, 22, 32],
}
)
# train the original model on the training data
original_model.train(train_dataset)
# plug a mapie regressor on top, with the "prefit" option for "cv"
model = load_model("mapie_regressor",
estimator=original_model,
cv="prefit",
x_features=original_model.x_features,
y_features=original_model.y_features,
)
# calibrate the model on the validation data
model.calibrate_uncertainty(data=validation_dataset)
# predict on some new data (here, for simplicity, on the validation data)
model.predict_with_prediction_interval(data=validation_dataset, confidence=0.6)
/home/runner/work/molflux/molflux/.cache/nox/docs_build-3-11/lib/python3.11/site-packages/sklearn/utils/validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
({'mapie_regressor[random_forest_regressor]::y': [3.08,
3.08,
3.08,
7.26,
21.14,
21.14]},
{'mapie_regressor[random_forest_regressor]::y::prediction_interval': [(-12.999999999999998,
19.159999999999997),
(-12.999999999999998, 19.159999999999997),
(-12.999999999999998, 19.159999999999997),
(-8.819999999999999, 23.339999999999996),
(5.060000000000002, 37.22),
(5.060000000000002, 37.22)]})
Conditional use of model-agnostic uncertainty#
Tip
As mentioned above, there are a number of protocols that can be used to check if a loaded model does support a specific uncertainty method. This can then be used to allow conditional execution of code to wrap models that do not support uncertainty with model agnostic uncertainty methods.
import datasets
from copy import copy
from molflux.modelzoo import load_from_dicts, load_model, supports_prediction_interval
list_of_configs = [
{
'name': 'random_forest_regressor',
'config':
{
'x_features': ['x1', 'x2'],
'y_features': ['y'],
'n_estimators': 500,
'max_depth': 10,
},
},
{
'name': 'cat_boost_regressor',
'config':
{
'x_features': ['x1', 'x2'],
'y_features': ['y'],
},
}
]
models = load_from_dicts(list_of_configs)
train_dataset = datasets.Dataset.from_dict(
{
"x1": [0, 1, 2, 3, 4, 5],
"x2": [0, -1, -2, -3, -4, -5],
"y": [2, 4, 6, 8, 10, 12],
}
)
for original_model in models.values():
if supports_prediction_interval(original_model):
model = copy(original_model)
else:
# plug a mapie regressor on top
model = load_model(
"mapie_regressor",
estimator=original_model,
cv=5,
x_features=original_model.x_features,
y_features=original_model.y_features,
)
model.train(train_dataset)
predictions, prediction_intervals = model.predict_with_prediction_interval(train_dataset, confidence=0.5)
print(original_model.name, predictions, prediction_intervals)
/home/runner/work/molflux/molflux/.cache/nox/docs_build-3-11/lib/python3.11/site-packages/sklearn/utils/validation.py:1339: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
random_forest_regressor {'mapie_regressor[random_forest_regressor]::y': [3.6039999999999996, 4.27, 6.013333333333333, 8.077999999999998, 9.736666666666666, 10.44]} {'mapie_regressor[random_forest_regressor]::y::prediction_interval': [(2.56, 3.6399999999999997), (3.7239999999999998, 4.3919999999999995), (5.287999999999999, 6.715999999999999), (7.244, 8.66), (9.448, 10.456), (10.216000000000001, 11.424)]}
cat_boost_regressor {'cat_boost_regressor::y': [2.0485728335586764, 4.016775906939269, 6.004134017778156, 7.99636438644607, 9.983561893437876, 11.95137955029383]} {'cat_boost_regressor::y::prediction_interval': [(2.041126053116822, 2.056019614000531), (4.015887619159785, 4.017664194718752), (6.004080075864355, 6.004187959691958), (7.996322667137847, 7.996406105754292), (9.98270901869422, 9.984414768181534), (11.943918162476978, 11.958840938110681)]}