PhysicsML features#

The physicsml package provides its own featuriser to extract relevant features for 3d models. The featurisers are built on top of the molflux.features (see docs for more info).

The featuriser is built on top of either molecules from rdkit and openeye. It takes in molecule objects or their binary serialisation and uses the built-in rdkit or openeye functions to extract a bunch of different atom and bond features. By default, it will also extract the coordinates of the molecules.

Config template#

First, let’s take a look at the featurisation config template. This is based on the molflux config (see here).

featurisation_metadata = {
    "version": 1,
    "config": [
        {
            "column": <molecule_column>,
            "representations": [
                {
                    "name": "physicsml_features",
                    "config": {
                        "atomic_number_mapping": <mapping dict>,
                        "atomic_energies": <energies dict>,
                        "atom_features": <list of atom features>,
                        "bond_features": <list of bond features>,
                        "backend": <the backend name>,
                    },
                    "as": "{feature_name}",
                }
            ]
        }
    ]
}

The featuriser is called physicsml_features. To featurise the QM9 dataset for example, you can do

import logging
logging.disable(logging.CRITICAL)

from molflux.datasets import load_dataset_from_store
from molflux.core import featurise_dataset

dataset = load_dataset_from_store("gdb9_trunc.parquet")

featurisation_metadata = {
    "version": 1,
    "config": [
        {
            "column": "mol_bytes",
            "representations": [
                {
                    "name": "physicsml_features",
                    "config": {
                        "atomic_number_mapping": {
                            1: 0,
                            6: 1,
                            7: 2,
                            8: 3,
                            9: 4,
                        },
                        "atomic_energies": {
                            1: -0.6019805629746086,
                            6: -38.07749583990695,
                            7: -54.75225433326539,
                            8: -75.22521603087064,
                            9: -99.85134426752529
                        },
                        "backend": "rdkit",
                    },
                    "as": "{feature_name}"
                }
            ]
        }
    ]
}

# featurise the mols
featurised_dataset = featurise_dataset(
    dataset,
    featurisation_metadata=featurisation_metadata,
    num_proc=4,
    batch_size=100,
)

print(featurised_dataset)
Dataset({
    features: ['mol_bytes', 'mol_id', 'A', 'B', 'C', 'mu', 'alpha', 'homo', 'lumo', 'gap', 'r2', 'zpve', 'u0', 'u298', 'h298', 'g298', 'cv', 'u0_atom', 'u298_atom', 'h298_atom', 'g298_atom', 'physicsml_atom_idxs', 'physicsml_atom_numbers', 'physicsml_coordinates', 'physicsml_total_atomic_energy'],
    num_rows: 1000
})

You can see the additional feature columns added to the dataset. We discuss each one (and some more) below.

atomic_number_mapping#

The only required kwarg in the config is the atomic_number_mapping. This defines how the atomic numbers are mapped into indices for one-hot encoding. For example, in the ani1x dataset, this would look like {1: 0, 6: 1, 7: 2, 8: 3}. The computed features will be a list of mapped atomic numbers for each molecule, the list of the indices of the atoms in the molecule, and the coordinates of the atoms. For example, CH4 would be

{
    'physicsml_atom_numbers': [1, 0, 0, 0, 0],
    'physicsml_atom_idxs': [0, 1, 2, 3, 4],
    'physicsml_coordinates': [
        [...],  # C
        [...],  # H
        [...],  # H
        [...],  # H
        [...],  # H
    ]
}

atomic_energies#

This is used for adding the self energies of molecules in energy prediction models and is vital for achieving good performance. It essentially transforms the problem from predicting total energies to predicting interaction energies. The config can be specified simply as {atomic_num: energy}. For example with the ani1x dataset, we have

{
    1: -0.60198056,
    6: -38.0774958,
    7: -54.7522543,
    8: -75.2252160,
}

The computed feature would be a float (the sum of all the self energies in a molecule). For example, CH4 would be

{
    'physicsml_total_atomic_energy': -40.48541804,
}

Important

Make sure the energies are in the same unit as the total energy!

Since charged atoms have different self energies than their neutral counterparts, the featuriser also supports providing energies of charged species. This must be specified in a nested dictionary as {atomic_num: {formal_charge: energy}}. If not specified, the featuriser will assume that the energies provided are for the neutral atoms. The atomic-number-charge-energy dict can be provided for only some atomic numbers while others can just be specified as floats.

Note

If only a single atomic energy (float) is specified and if the featuriser encounters a charged atom, it will raise a warning and use the only specified energy for that atom. If the value is a dictionary of {charges: atomic energies} and the featuriser encounters a charged atom which is not specified in the dictionary, then it will raise a RuntimeError and fail.

In general, there are two ways to obtain atomic self energies for featurisation: From numerical QM solutions or via dataset least squares regression fitting. For the second option, we provide a convenience function to do so. It uses the molecule column, the energy column, and the backend to perform a least squares regression using the number and types of atoms in the molecules against the energy

from physicsml.utils import get_atomic_energies

get_atomic_energies(
    dataset["mol_bytes"],
    dataset['u0'],
    backend="rdkit"
)
{1: -0.6045351663510158,
 6: -38.07364113277528,
 7: -54.7480044289392,
 8: -75.22278796676991,
 9: -99.86075983009577}

atom_features#

Apart from atomic numbers, we optionally provide a way to compute atomic features which will be concatenated to the node features later on in the dataloader. There are a few to choose from:

  • 'formal_charge': The formal charge of the atom

  • 'degree': The degree of the atom

  • 'stereo': The stereochemistry of the atom

  • 'is_in_ring': Whether the atom is in a ring

  • 'is_aromatic': Whether the atom is in an aromatic ring

  • 'is_chiral': Whether the atom is chiral

You can specify any or all of these features as a list.

Note

You do not have to worry about the order of the strings in the list, the featuriser will order them alphabetically for you.

The computed features would be a list of list (one for each atom in the molecule) with the concatenated one-hot encoding of the features. For CH4, it would be

{
    'physicsml_atom_features': [
        [...],  # C features
        [...],  # H features
        [...],  # H features
        [...],  # H features
        [...],  # H features
    ]
}

Warning

We do not guarantee that the molecules being featurised have correct, sanitised atom features. That is the user’s responsibility.

bond_features#

We also optionally provide a way to compute bond features which can be concatenated to edge features later on in the models. There are a few to choose from:

  • 'order': The order of the bond

  • 'stereo': The stereochemistry of the bond

  • 'is_in_ring': Whether the bond is in a ring

  • 'is_chiral': Whether the bond is chiral

  • 'is_rotatable': Whether the bond is rotatable

You can specify any or all of these features as a list.

Note

You do not have to worry about the order of the strings in the list, the featuriser will order them alphabetically for you.

The computed features would be sparsely encoded as a list of list (one for each bond in the molecule) with the concatenated one-hot encoding of the features and the list of bond indices. For CH4, it would be

{
    'physicsml_bond_idxs': [
        [0, 1],
        [0, 2],
        [0, 3],
        [0, 4],
    ],
    'physicsml_bond_features': [
        [...],  # C-H (0, 1) features
        [...],  # C-H (0, 2) features
        [...],  # C-H (0, 3) features
        [...],  # C-H (0, 4) features
    ]
}

Warning

We do not guarantee that the molecules being featurised have correct, sanitised bond features. That is the user’s responsibility.

Important

If bond_features are computed and specified in the model config, then edge features (notice, edge not bond) will be created for all the edges in the input graph. The edges which are bona fide bonds will use the features computed here. The edges which are not bonds will use a feature vector of zeros with the same dimension as the bond_features.

backend#

We provide multiple backends for computing and extracting features. Currently, we support rdkit (public, free) and openeye (proprietary, paid).

Warning

Make sure that you load the datasets and featurise them with the same backend!