Basic usage#

In this section, we illustrate how to use molflux.features. These examples will provide you with a starting point.

Browsing#

First, we review what representations are available for use. These are conveniently categorised (for example, into core, rdkit, etc.). To view what’s available you can do

from molflux.features import list_representations

catalogue = list_representations()

print(catalogue)
{'core': ['character_count', 'exploded', 'sum'], 'openeye': ['aromatic_ring_count', 'canonical_oemol', 'canonical_smiles', 'circular', 'hermite', 'lingo', 'maccs', 'molecular_weight', 'net_charge', 'num_acceptors', 'num_donors', 'path', 'rotatable_bonds', 'tpsa', 'tree', 'x_log_p'], 'rdkit': ['atom_pair', 'atom_pair_unfolded', 'avalon', 'drfp', 'layered', 'maccs_rdkit', 'map_light', 'mhfp', 'mhfp_unfolded', 'morgan', 'morgan_unfolded', 'pattern', 'rdkit_descriptors_2d', 'topological', 'topological_torsion', 'topological_torsion_unfolded', 'toxicophores']}

This returns a dictionary of available representations (organised by categories and name). There are a few to choose from. By default molflux.features will come with core features. You can get more representations by pip installing packages which have molflux representations. To see how you can add your own representation, see How to add your own representations.

Loading representations#

Loading a molflux.features representation is very easy, simply do

from molflux.features import load_representation

representation = load_representation(name="morgan")

print(representation)
Representation(
	name: "morgan",
	tag: "morgan",
	signature: self.featurise(*columns: Union[collections.abc.Iterable[str], collections.abc.Iterable[Any], collections.abc.Iterable[bytes]], radius: int = 3, n_bits: int = 2048, invariants: list[int] | None = None, from_atoms: list[int] | None = None, use_chirality: bool = False, use_bond_types: bool = True, use_features: bool = False, bit_info: dict | None = None, include_redundant_environments: bool = False, **kwargs: Any) -> dict[str, list[list[int]]],
	description: """
Morgan fingerprint.

These fingerprints are similar to the well-known ECFP or FCFP fingerprints,
depending on which invariants are used. These are implemented based on the
original paper. The algorithm follows the description in the paper as
closely as possible with the exception of the chemical feature definitions used
for the “Feature Morgan” fingerprint - the RDKit implementation uses the
feature types Donor, Acceptor, Aromatic, Halogen, Basic, and Acidic with
definitions adapted from those in [1]_. It is possible to provide your
own atom types. The fingerprints are available as either explicit or sparse
count vectors or explicit bit vectors.

The algorithm used is described in the paper
Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. JCIM 50:742-54 (2010)
https://doi.org/10.1021/ci100050t

The original implementation was done using this paper:
D. Rogers, R.D. Brown, M. Hahn J. Biomol. Screen. 10:682-6 (2005)
and an unpublished technical report:
http://www.ics.uci.edu/~welling/teaching/ICS274Bspring06/David%20Rogers%20-%20ECFP%20Manuscript.doc

[1]_ https://doi.org/10.1002/(SICI)1097-0290(199824)61:1%3C47::AID-BIT9%3E3.0.CO;2-Z
""",
	usage: """Generates Morgan fingerprints for each input molecule.

        Args:
            samples: The molecules to be fingerprinted.
            radius: The number of iterations to grow the fingerprint.
            n_bits: The size of the fingerprint. Defaults to `2048`.
            invariants: The set of atom invariants to be used. Defaults to
                `None`, which corresponds to ECFP-type invariants.
            from_atoms: If provided, only the atoms in the vector will be used
                as centers in the fingerprint. Defaults to `None`.
            use_chirality: If set, additional information will be added to the
                fingerprint when chiral atoms are discovered, generating
                different fingerprints. Defaults to `False`.
            use_bond_types: If set, bond types will be included as part of the
                hash for calculating bits. Defaults to `True`.
            use_features: Defaults to `False`.
            bit_info: Defaults to `None`.
            include_redundant_environments: If not None, the check for redundant
                atom environments will not be done. Defaults to `False`.

        Returns:
            MACCS fingerprints, as lists of bits.

        Examples:
            >>> from molflux.features import load_representation
            >>> representation = load_representation('morgan')
            >>> samples = ['c1ccccc1']
            >>> representation.featurise(samples, n_bits=16)
            {'morgan': [[1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}
        """
	state: {'radius': 3, 'n_bits': 2048, 'invariants': None, 'from_atoms': None, 'use_chirality': False, 'use_bond_types': True, 'use_features': False, 'bit_info': None, 'include_redundant_environments': False}
)

By printing the loaded representation, you get more information about it. Each representation has a name, and a tag (to uniquely identify it in case you would like to generate multiple copies of the same representations but with different configurations). You can also see the optional featurisation arguments (and their default values) in the signature. There is also a short description of the representation.

You can also load a representation from a config. A molflux.features config is a dictionary specifying the representation to be loaded. A config dictionary must have the following format

representation_dict = {
    'name': '<name of the representation>',
    'config': '<kwargs for instantiating representation>'
    'presets': '<kwarg presets for featurising>'
}

The name key specifies the name of the representation to load from the catalogue. The config key specifies the arguments that are needed for instantiating the representation and the presets key specifies some preset kwargs to apply upon featurisation (for example, the length of a fingerprint). If neither is specified, the representation will use default values.

To load a representation from a config

from molflux.features import load_from_dict

config = {
    'name': 'morgan',
    'presets':
        {
            'n_bits': 16,
            'radius': 3,
        },
}

representation = load_from_dict(config)

print(representation.state)
{'radius': 3, 'n_bits': 16, 'invariants': None, 'from_atoms': None, 'use_chirality': False, 'use_bond_types': True, 'use_features': False, 'bit_info': None, 'include_redundant_environments': False}

For convenience, you can also load a group of representations all at once by specifying a list of configs.

from molflux.features import load_from_dicts

config = [
        {
            'name': 'character_count',
        },
        {
            'name': 'morgan',
            'presets':
                {
                    'n_bits': 16,
                    'radius': 4,
                },
        }
]

representations = load_from_dicts(config)

print(representations)
Representations(['character_count', 'morgan'])

Finally, you can load representations from a yaml file. You can use a single yaml file which includes configs for all the molflux tools, and molflux.features will know how to extract the relevant document it needs. To do so, you need to define a yaml file with the following example document:

---
version: v1
kind: representations
specs:
    - name: character_count
    - name: morgan
      presets:
        - n_bits: 16
        - radius: 4
...

It consists of a version (this is the version of the config format, for now just v1), kind of config (in this case representations), and specs. specs is where the configs are defined. The yaml file can include configs for other molflux modules as well. To load this yaml file, you can simply do

from molflux.features import load_from_yaml

representations = load_from_yaml(path_to_yaml_file)

print(representations)

Featurisation#

After loading a representation (or group of representations), you can apply them to molecules to compute the features. The input to molflux.features depends on the representation, but in general all representations can accept SMILES (or binary serialised molecules from rdkit or openeye). Molecules can be passed individually or as a list.

from molflux.features import load_representation

representation = load_representation("character_count")

data = ["CCCC", "c1ccc(cc1)C(C#N)OC2C(C(C(C(O2)COC3C(C(C(C(O3)CO)O)O)O)O)O)O"]
featurised_data = representation.featurise(data)

print(featurised_data)
{'character_count': [4, 59]}

This will return a dictionary with the representation tag as the key and the computed features as the value. For a group of representations, you can follow the same procedure

from molflux.features import load_from_dicts

feature_config = [
        {
            'name': 'character_count',
        },
        {
            'name': 'morgan',
            'presets':
                {
                    'n_bits': 16,
                    'radius': 4,
                },
        }
    ]

representations = load_from_dicts(feature_config)

data = ["CCCC", "c1ccc(cc1)C(C#N)OC2C(C(C(C(O2)COC3C(C(C(C(O3)CO)O)O)O)O)O)O"]
featurised_data = representations.featurise(data)

print(featurised_data)
{'character_count': [4, 59], 'morgan': [[1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

This will return a dictionary with all the features (where the tags as the keys and the features as the values).

Note

The molflux package also builds on top of the above featurising methods to reproduce featurisation in production. See Productionising featurisation.

Integration with molflux.datasets#

You can easily featurise your datasets from molflux.datasets using molflux.features representations. To learn more, see here.