Basic usage#
In this section, we illustrate how to use molflux.features
. These examples will provide you with a starting
point.
Browsing#
First, we review what representations are available for use. These are conveniently categorised (for example,
into core
, rdkit
, etc.). To view what’s available you can do
from molflux.features import list_representations
catalogue = list_representations()
print(catalogue)
{'core': ['character_count', 'exploded', 'sum'], 'openeye': ['aromatic_ring_count', 'canonical_oemol', 'canonical_smiles', 'circular', 'hermite', 'lingo', 'maccs', 'molecular_weight', 'net_charge', 'num_acceptors', 'num_donors', 'path', 'rotatable_bonds', 'tpsa', 'tree', 'x_log_p'], 'rdkit': ['atom_pair', 'atom_pair_unfolded', 'avalon', 'drfp', 'layered', 'maccs_rdkit', 'map_light', 'mhfp', 'mhfp_unfolded', 'morgan', 'morgan_unfolded', 'pattern', 'rdkit_descriptors_2d', 'topological', 'topological_torsion', 'topological_torsion_unfolded', 'toxicophores']}
This returns a dictionary of available representations (organised by categories and name
). There are a few to choose from.
By default molflux.features
will come with core
features. You can get more representations by pip installing packages
which have molflux
representations. To see how you can add your own representation, see How to add your own representations.
Loading representations#
Loading a molflux.features
representation is very easy, simply do
from molflux.features import load_representation
representation = load_representation(name="morgan")
print(representation)
Representation(
name: "morgan",
tag: "morgan",
signature: self.featurise(*columns: Union[collections.abc.Iterable[str], collections.abc.Iterable[Any], collections.abc.Iterable[bytes]], radius: int = 3, n_bits: int = 2048, invariants: list[int] | None = None, from_atoms: list[int] | None = None, use_chirality: bool = False, use_bond_types: bool = True, use_features: bool = False, bit_info: dict | None = None, include_redundant_environments: bool = False, **kwargs: Any) -> dict[str, list[list[int]]],
description: """
Morgan fingerprint.
These fingerprints are similar to the well-known ECFP or FCFP fingerprints,
depending on which invariants are used. These are implemented based on the
original paper. The algorithm follows the description in the paper as
closely as possible with the exception of the chemical feature definitions used
for the “Feature Morgan” fingerprint - the RDKit implementation uses the
feature types Donor, Acceptor, Aromatic, Halogen, Basic, and Acidic with
definitions adapted from those in [1]_. It is possible to provide your
own atom types. The fingerprints are available as either explicit or sparse
count vectors or explicit bit vectors.
The algorithm used is described in the paper
Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. JCIM 50:742-54 (2010)
https://doi.org/10.1021/ci100050t
The original implementation was done using this paper:
D. Rogers, R.D. Brown, M. Hahn J. Biomol. Screen. 10:682-6 (2005)
and an unpublished technical report:
http://www.ics.uci.edu/~welling/teaching/ICS274Bspring06/David%20Rogers%20-%20ECFP%20Manuscript.doc
[1]_ https://doi.org/10.1002/(SICI)1097-0290(199824)61:1%3C47::AID-BIT9%3E3.0.CO;2-Z
""",
usage: """Generates Morgan fingerprints for each input molecule.
Args:
samples: The molecules to be fingerprinted.
radius: The number of iterations to grow the fingerprint.
n_bits: The size of the fingerprint. Defaults to `2048`.
invariants: The set of atom invariants to be used. Defaults to
`None`, which corresponds to ECFP-type invariants.
from_atoms: If provided, only the atoms in the vector will be used
as centers in the fingerprint. Defaults to `None`.
use_chirality: If set, additional information will be added to the
fingerprint when chiral atoms are discovered, generating
different fingerprints. Defaults to `False`.
use_bond_types: If set, bond types will be included as part of the
hash for calculating bits. Defaults to `True`.
use_features: Defaults to `False`.
bit_info: Defaults to `None`.
include_redundant_environments: If not None, the check for redundant
atom environments will not be done. Defaults to `False`.
Returns:
MACCS fingerprints, as lists of bits.
Examples:
>>> from molflux.features import load_representation
>>> representation = load_representation('morgan')
>>> samples = ['c1ccccc1']
>>> representation.featurise(samples, n_bits=16)
{'morgan': [[1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}
"""
state: {'radius': 3, 'n_bits': 2048, 'invariants': None, 'from_atoms': None, 'use_chirality': False, 'use_bond_types': True, 'use_features': False, 'bit_info': None, 'include_redundant_environments': False}
)
By printing the loaded representation, you get more information about it. Each representation has a name
, and a tag
(to uniquely identify it in case you would like to generate multiple copies of the same representations but with different
configurations). You can also see the optional featurisation arguments (and their default values) in the signature.
There is also a short description of the representation.
You can also load a representation from a config. A molflux.features
config is a dictionary specifying the representation
to be loaded. A config dictionary must have the following format
representation_dict = {
'name': '<name of the representation>',
'config': '<kwargs for instantiating representation>'
'presets': '<kwarg presets for featurising>'
}
The name
key specifies the name
of the representation to load from the catalogue. The config
key
specifies the arguments that are needed for instantiating the representation and the presets
key specifies some preset
kwargs to apply upon featurisation (for example, the length of a fingerprint). If neither is specified, the
representation will use default values.
To load a representation from a config
from molflux.features import load_from_dict
config = {
'name': 'morgan',
'presets':
{
'n_bits': 16,
'radius': 3,
},
}
representation = load_from_dict(config)
print(representation.state)
{'radius': 3, 'n_bits': 16, 'invariants': None, 'from_atoms': None, 'use_chirality': False, 'use_bond_types': True, 'use_features': False, 'bit_info': None, 'include_redundant_environments': False}
For convenience, you can also load a group of representations all at once by specifying a list of configs.
from molflux.features import load_from_dicts
config = [
{
'name': 'character_count',
},
{
'name': 'morgan',
'presets':
{
'n_bits': 16,
'radius': 4,
},
}
]
representations = load_from_dicts(config)
print(representations)
Representations(['character_count', 'morgan'])
Finally, you can load representations from a yaml file. You can use a single yaml file which includes configs for all the molflux
tools,
and molflux.features
will know how to extract the relevant document it needs. To do so, you need to define a yaml file with the
following example document:
---
version: v1
kind: representations
specs:
- name: character_count
- name: morgan
presets:
- n_bits: 16
- radius: 4
...
It consists of a version (this is the version of the config format, for now just v1
), kind
of config (in this case
representations
), and specs
. specs
is where the configs are defined. The yaml file can include
configs for other molflux
modules as well. To load this yaml file, you can simply do
from molflux.features import load_from_yaml
representations = load_from_yaml(path_to_yaml_file)
print(representations)
Featurisation#
After loading a representation (or group of representations), you can apply them to molecules to compute the features.
The input to molflux.features
depends on the representation, but in general all representations can accept SMILES
(or binary serialised molecules from rdkit
or openeye
). Molecules can be passed individually or as a list.
from molflux.features import load_representation
representation = load_representation("character_count")
data = ["CCCC", "c1ccc(cc1)C(C#N)OC2C(C(C(C(O2)COC3C(C(C(C(O3)CO)O)O)O)O)O)O"]
featurised_data = representation.featurise(data)
print(featurised_data)
{'character_count': [4, 59]}
This will return a dictionary with the representation tag
as the key and the computed features as the value. For a group
of representations, you can follow the same procedure
from molflux.features import load_from_dicts
feature_config = [
{
'name': 'character_count',
},
{
'name': 'morgan',
'presets':
{
'n_bits': 16,
'radius': 4,
},
}
]
representations = load_from_dicts(feature_config)
data = ["CCCC", "c1ccc(cc1)C(C#N)OC2C(C(C(C(O2)COC3C(C(C(C(O3)CO)O)O)O)O)O)O"]
featurised_data = representations.featurise(data)
print(featurised_data)
{'character_count': [4, 59], 'morgan': [[1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
This will return a dictionary with all the features (where the tags
as the keys and the features as the values).
Note
The molflux
package also builds on top of the above featurising methods to reproduce featurisation in production.
See Productionising featurisation.
Integration with molflux.datasets
#
You can easily featurise your datasets from molflux.datasets
using molflux.features
representations.
To learn more, see here.