Chapter 4


5    Data Analysis

The data analysis is designed for interpreting HTS data and ADME-Tox data by extracting features which distinguish inactive molecules that show undesirable activity from active molecules. These features may be properties calculated by THINK, functional groups used in the functional group keys or pharmacophores. The implementation uses a form of discriminant analysis appropriate for the types and distribution of data (which is not usually a statistical normal distribution). This technique is appropriate for up to 10,000 molecules except when pharmacophore option is used when memory limitations usually restrict the analysis to a maximum of about 1,000 molecules. Larger numbers of pharmacophores can be imported and analysed in relational databases.

The pharmacophores exhibited by each molecule must be calculated in advance of the data analysis and stored in a file named molecule.phm where molecule is the name of the molecule. It is recommended that the SAVE OPTIONS=FILTER command is used when generating the pharmacophore files to prevent files being created for very flexible molecules. It is also recommended that all salts are removed from the structures before the pharmacophores are calculated, since salts may contain centres (eg oxygen atoms) which would lead to the creation of additional pharmacophores partially or wholly outside the molecule that is to be used in the data analysis. This most conveniently executed as part of a script which loops over all molecules:

OPEN FILE='P1'
JMOL = 0
WHILE JMOL < #NMOLES
  JMOL = JMOL + 1
  MOL = $MOLECULES(JMOL)
  Message = "Processing: " . MOL
  WRITE CONSOLE Message
  FILEOUT = "-" . MOL . ".phm"
  SAVE FILE='FILEOUT' MOLECULE='MOL' OPTIONS=FILTER
END

The results of the analysis are stored in a learn file (with a “.lrn” extension) which contains the field name undesirable substructures, the F-test value, the acceptable ranges and the confidence limit for each feature. The headers or first line in the following example is NOT included in the file.

Functional group or field nameF-testMinMaxConfidenceComment
ZC(=O)OH 21.102 0.5 1.0E35 99.51 (203/962:0/32)
OCCN 12.162 0.5 1.0E35 99.153(117/759:0/32)
[CAR]OH 10.499 0.5 1.0E35 99.02 (101/642:0/32)
Branches 6.42 7 1.0E35 98.974(193/541:1/32)

Learn files created by the data analysis include a comment (Is/It:As/At) where Is and As are the number of inactive and active molecules respectively selected by the feature, and It and At are the total number of inactive and active molecules.

Learn files can be used as rejection criteria in de novo molecule generation or for highlighting rejected molecules in the spreadsheet. They can also be used to filter the molecules saved to a file or accepted during a search. In general, they have the same name as the activity field with which they are associated. THINK v1.25 does not support the use of learn files when selecting subsets of molecules based on property diversity.

5.1   Key Features

The algorithms used have been optimised to extract features that may be associated with undesirable activity, with the consequence that it is important to specify whether desirable activity increases or decreases as the values stored in the activity field increase. In addition, because HTS data is often unreliable, THINK may be instructed to take only the significantly active and inactive molecules, rather than simply dividing the compounds into active and inactives. This is done by specifying a significance value in the range 0-0.5, which selects only the most and least active molecules for the analysis and discards those in a mid-range “grey area”. A significance value of 0.5 would simply divide the activity range at the mid-point for actives and inactive molecules, whereas 0.3 would take the top and bottom 30% of the compounds.

5.2   F-test for Properties

For properties, the active molecules may be divided from the inactive molecules by a single value (as shown in Case I), or may be bracketed by a pair of values (Case II).

THINK assumes a normal data distribution to estimate these cutoff values for all the properties. The F-test for each feature (which is classically the variance explained by the feature divided by the unexplained variance) is calculated based on the actual discrimination achieved, according to the following formula:

where IS and AS are the number of inactive and active molecules respectively selected by the cutoff value or range, and IT and AT are the total number of inactive and active molecules.

5.3   F-test for Functional Groups

For functional group-based discrimination, THINK determines whether a functional group occurs more frequently in the actives or inactives and hence whether it is desirable or undesirable. The actual discrimination achieved by the functional group is determined in an analogous manner to that achieved by properties.

5.4   F-test for Pharmacophores

The number of times each pharmacophore is exhibited is used in the data analysis in the same way as the numerical properties such as the count of the number of rings. The number of pharmacophores being processed is often very large and consequently the analysis is significantly slower than those which use just properties and/or functional group keys.

5.5   Prediction

Discriminating features are iteratively extracted in order of decreasing F-test result. The confidence and risk for the feature are computed using the following formulae, which are based on a worst-case application of Bayes theorem:

where IS and AS are the number of inactive and active molecules respectively selected by the feature. Note that this analysis approach is predicting inactivity, not activity, hence a large statistical confidence value is associated with inactivity and a low risk value with an undesirable (low activity) molecule.


Chapter 6