Skip to content

medchem.functional

medchem.functional.alert_filter(mols, alerts, alerts_db=None, n_jobs=1, progress=False, return_idx=False)

Filter a dataset of molecules, based on common structural alerts and specific rules.

True is good

Returning True means the molecule does not match any of the structural alerts.

See Also

alert_filter is a convenient functional API for the medchem.structural.CommonAlertsFilters class.

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

List of molecules to filter

required
alerts List[str]

List of alert collections to screen for. See CommonAlertsFilters.list_default_available_alerts()

required
alerts_db Optional[Union[PathLike, str]]

Path to the alert file name. The internal default file (alerts.csv) will be used if not provided

None
n_jobs Optional[int]

Number of workers to use

1
progress bool

Whether to show progress bar

False
return_idx bool

Whether to return the filtered index

False

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule IS OK (not found in the alert catalog).

medchem.functional.nibr_filter(mols, n_jobs=None, max_severity=10, progress=False, return_idx=False)

Filter a set of molecules based on the Novartis Institutes for BioMedical Research screening deck curation process Schuffenhauer, A. et al. Evolution of Novartis' small molecule screening deck design, J. Med. Chem. (2020)

The severity argument corresponds to the accumulated severity for a compounds accross all pattern in the catalog.

True is good

Returning True means the molecule does not match any of the structural alerts.

See Also

nibr_filter is a convenient functional API for the medchem.structural.NIBRFilters class.

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
n_jobs Optional[int]

number of parallel job to run. Sequential by default

None
max_severity int

maximum severity allowed. Default is <10

10
progress bool

whether to show progress bar

False
return_idx bool

Whether to return the filtered index

False

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule IS NOT REJECTED (i.e not found in the alert catalog).

medchem.functional.catalog_filter(mols, catalogs, return_idx=False, n_jobs=-1, progress=False, progress_leave=False, scheduler='processes', batch_size=100)

Filter a list of compounds according to a catalog of structural alerts and patterns

True is good

Returning True means the molecule does not match any of the structural alerts.

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
catalogs List[Union[str, FilterCatalog]]

list of catalogs (name or FilterCatalog)

required
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel job to run. Sequential by default

-1
progress bool

whether to show progress bar

False
progress_leave bool

whether to leave the progress bar after completion

False
scheduler str

joblib scheduler to use

'processes'
batch_size int

batch size for parallel processing. Note that batch_size should be increased if the number of used CPUs gets very large.

100

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule is not found in the catalog.

medchem.functional.chemical_group_filter(mols, chemical_group, exact_match=False, return_idx=False, n_jobs=None, progress=False, progress_leave=False, scheduler='threads')

Filter a list of compounds according to a chemical group instance.

Warning

This function will return the list of molecules that DO NOT match the chemical group.

See Also

Consider exploring the medchem.groups.ChemicalGroup class.

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
chemical_group ChemicalGroup

a chemical group instance with the required functional groups to use.

required
exact_match bool

whether to use an exact match of the chemical group patterns (will switch to smiles )

False
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel job to run. Sequential by default

None
progress bool

whether to show progress bar

False
progress_leave bool

whether to leave the progress bar after completion

False
scheduler str

joblib scheduler to use

'threads'

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule DOES NOT MATCH the groups.

medchem.functional.rules_filter(mols, rules, return_idx=False, n_jobs=None, progress=False, progress_leave=False, scheduler='processes')

Filter a list of compounds according to a predefined set of rules

True is good

Returning True means the molecule passes all the rules.

See Also

Consider exploring the medchem.rules.RuleFilters class.

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
rules Union[List[Any], RuleFilters]

list of rules to apply to the input molecules.

required
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel job to run. Sequential by default

None
progress bool

whether to show progress bar

False
scheduler str

joblib scheduler to use

'processes'

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule MATCH the rule constraints.

medchem.functional.complexity_filter(mols, complexity_metric='bertz', threshold_stats_file='zinc_15_available', limit='99', return_idx=False, n_jobs=None, progress=False, progress_leave=False, scheduler='processes')

Filter a list of compounds according to a complexity metric

True is good

Returning True means the molecule passes the complexity filters.

See Also

Consider exploring the medchem.complexity.ComplexityFilter class.

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
complexity_metric str

complexity metric to use Use ComplexityFilter.list_default_available_filters to list default filters. The following complexity metrics are supported by default:

  • bertz: bertz complexity index
  • sas: synthetic accessibility score (zinc_15_available only)
  • qed: qed score (zinc_15_available only)
  • clogp: clogp for how greasy a molecule is compared to other in the same mw range (zinc_15_available only)
  • whitlock: whitlock complexity index
  • barone: barone complexity index
  • smcm: synthetic and molecular complexity
  • twc: total walk count complexity (zinc_15_available only)
'bertz'
threshold_stats_file str

complexity threshold statistics file to use

'zinc_15_available'
limit str

complexity outlier percentile to use

'99'
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel job to run. Sequential by default

None
progress bool

whether to show progress bar

False
progress_leave bool

whether to leave the progress bar after completion

False
scheduler str

joblib scheduler to use

'processes'

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule MATCH the rules.

medchem.functional.bredt_filter(mols, return_idx=False, n_jobs=None, progress=False, progress_leave=False, scheduler='threads', batch_size=100)

Filter a list of compounds according to Bredt's rules

True is good

Returning True means the molecule does not violate the Bredt's rules.

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel job to run. Sequential by default

None
progress bool

whether to show progress bar

False
progress_leave bool

whether to leave the progress bar after completion

False
scheduler str

joblib scheduler to use

'threads'
batch_size int

batch size for parallel processing. Note that batch_size should be increased if the number of used CPUs gets very large.

100

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule is not toxic.

medchem.functional.molecular_graph_filter(mols, max_severity=5, return_idx=False, n_jobs=None, progress=False, progress_leave=False, scheduler='threads')

Filter a list of compounds according to unstable molecular graph patterns. This list was obtained from observation around technically valid molecular graphs from deep generative models that are not stable.

The disallowed graphs are:

  • K3,3 or K2,4 structures
  • Cone of P4 or K4 with 3-ear
  • Node in more than one ring of length 3 or 4
True is good

Returning True means the molecule does not violate the molecular graph instability rules.

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
max_severity Optional[int]

maximum acceptable severity (1-10). Default is <5

5
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel job to run. Sequential by default

None
progress bool

whether to show progress bar

False
progress_leave bool

whether to leave the progress bar after completion

False
scheduler str

joblib scheduler to use

'threads'

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule is not toxic.

medchem.functional.lilly_demerit_filter(mols, max_demerits=160, return_idx=False, n_jobs=None, progress=False, progress_leave=False, scheduler='threads', batch_size=5000, **kwargs)

Run the Eli Lilly's demerit filter on current list of molecules

True is good

Returning True means the molecule does not violate the demerit rules.

See Also

Consider exploring the LillyDemeritsFilters class in medchem.structural.lilly_demerits

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules as smiles preferably

required
max_demerits Optional[int]

Cutoff to reject molecules Defaults to 160.

160
return_idx bool

whether to return a mask or a list of valid indexes

False
progress bool

whether to show progress bar

False
progress_leave bool

whether to leave the progress bar after completion

False
scheduler str

joblib scheduler to usescheduler

'threads'
batch_size int

batch size for parallel processing.

5000
kwargs Any

parameters specific to the demerits.score function

{}

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule is ok.

medchem.functional.protecting_groups_filter(mols, return_idx=False, protecting_groups=['fmoc', 'tert-butoxymethyl', 'tert-butyl carbamate', 'tert-butyloxycarbonyl'], n_jobs=None, progress=False, progress_leave=False, scheduler='threads')

Filter a list of compounds according to match to known protecting groups.

Warning

This function will return the list of molecules that DO NOT have the protecting groups.

!!! info "See Also" This is a syntaxic sugar for calling chemical_group_filter with the protecting groups subset.

Args: mols: list of input molecules protecting_groups: type of protection group to consider if not provided, will use all (not advised) return_idx: whether to return index or a boolean mask n_jobs: number of parallel job to run. Sequential by default progress: whether to show progress bar progress_leave: whether to leave the progress bar after completion scheduler: joblib scheduler to use

Returns: filtered_mask: boolean array (or index array) where true means the molecule DOES NOT MATCH the groups.

medchem.functional.macrocycle_filter(mols, max_cycle_size=10, return_idx=False, n_jobs=None, progress=False, scheduler='processes')

Find valid molecules that do not infringe the strict maximum cycle size.

True is good

Returning True means the molecule does not have rings larger than max_cycle_size

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
max_cycle_size int

strict maximum macrocycle size

10
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel job to run. Sequential by default

None
progress bool

whether to show progress bar

False
scheduler str

joblib scheduler to use

'processes'

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule is ok.

medchem.functional.atom_list_filter(mols, unwanted_atom_list=None, wanted_atom_list=None, return_idx=False, n_jobs=None, progress=False, scheduler='processes')

Find molecules without any atom from a set of unwanted atomic symbols and with all atoms in the set of wanted atom list.

True is good

Returning True means the molecule only has desirable atom types

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
unwanted_atom_list Optional[Sequence]

list of undesirable atomic symbol

None
wanted_atom_list Optional[Sequence]

list of desirable atomic symbol

None
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel jobs to run. Sequential by default

None
progress bool

whether to show progress bar

False
scheduler str

joblib scheduler to use

'processes'

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule is ok.

medchem.functional.ring_infraction_filter(mols, hetcycle_min_size=4, return_idx=False, n_jobs=None, progress=False, scheduler='processes')

Find molecules that have a ring infraction filter. This filter focuses on checking for rings that are too small to have an heteroatom.

True is good

Returning True means the molecule does not infringe the ring infraction filter.

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
hetcycle_min_size int

Minimum ring size before more than 1 hetero atom or any non single bond is allowed. This is a strict threshold (>)

4
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel job to run. Sequential by default

None
progress bool

whether to show progress bar

False
scheduler str

joblib scheduler to use

'processes'

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule is ok.

medchem.functional.num_atom_filter(mols, min_atoms=None, max_atoms=None, return_idx=False, n_jobs=None, progress=False, scheduler='processes')

Find molecules that match the number of atom range constraints

True is good

Returning True means the molecule does not infringe the number of atom filter.

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
min_atoms Optional[int]

strict minimum number of atoms (atoms > min_atoms)

None
max_atoms Optional[int]

strict maximum number of atoms (atoms < max_atoms)

None
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel job to run. Sequential by default

None
progress bool

whether to show progress bar

False
scheduler str

joblib scheduler to use

'processes'

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule is ok.

medchem.functional.num_stereo_center_filter(mols, max_stereo_centers=4, max_undefined_stereo_centers=2, return_idx=False, n_jobs=None, progress=False, scheduler='processes')

Find molecules that match the number of stereo center constraints. In general, molecules with too many undefined stereo centers are not desirable. This filter is useful for generated molecules.

True is good

Returning True means the molecule does not have issues with stereo centers.

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
max_stereo_centers int

strict maximum number of stereo centers (<). Default is 4

4
max_undefined_stereo_centers int

strict maximum number of undefined stereo centers (<). Default is 2

2
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel job to run. Sequential by default

None
progress bool

whether to show progress bar

False
scheduler str

joblib scheduler to use

'processes'

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule is ok.

medchem.functional.halogenicity_filter(mols, thresh_F=6, thresh_Br=3, thresh_Cl=3, return_idx=False, n_jobs=None, progress=False, scheduler='processes')

Find molecules that do not exceed halogen count threshold. This filter is useful for removing halogen biases in generated or enumerated chemical space during goal-directed optimization.

  • 6 for fluorine
  • 3 for bromine
  • 3 for chlorine
True is good

Returning True means the molecule does not have too many halogen atoms.

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
thresh_F int

maximum number of fluorine

6
thresh_Br int

maximum number of bromine

3
thresh_Cl int

maximum number of chlorine

3
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel job to run. Sequential by default

None
progress bool

whether to show progress bar

False
scheduler str

joblib scheduler to use

'processes'

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule is ok.

medchem.functional.symmetry_filter(mols, symmetry_threshold=0.8, return_idx=False, n_jobs=None, progress=False, scheduler='processes')

Find molecules that are not symmetrical, given a symmetry threshold. This filter was designed to offset the symmetry issue in molecular design, where some models tend to generate highly symmetrical molecules due to substructure bias.

True is good

Returning True means the molecule is not too symmetrical

Parameters:

Name Type Description Default
mols Sequence[Union[str, Mol]]

list of input molecules

required
symmetry_threshold float

threshold to consider a molecule highly symmetrical

0.8
return_idx bool

whether to return index or a boolean mask

False
n_jobs Optional[int]

number of parallel job to run. Sequential by default

None
progress bool

whether to show progress bar

False
scheduler str

joblib scheduler to use

'processes'

Returns:

Name Type Description
filtered_mask ndarray

boolean array (or index array) where true means the molecule is ok.