THINK includes functionality to search molecules stored in a SMILES or MDL SD file for a 2D substructure, a 3D pharmacophore and by similarity (based on functional groups). The results of the search are normally stored in a SMILES file for 2D searches and in an SDF file for 3D searches. It can sometimes be helpful to consider the exact match search to be a special case of substructure searching and the R-group search to be a special case of substructure searching. The site search is a variant of 3D searching. Using the display features of THINK, the hits can be viewed with the query superimposed for 2D searches, or superimposed on the pharmacophore for 3D searches. When queries are read from files, the option to omit implicit hydrogens can be essential.
The substructure searching allows a query consisting of a molecular fragment to be found in a subset of the atoms within the molecule being searched. In the exact match search, all the atoms in the molecule being searched must be matched; this means that the exact match search can be used to check for duplicates. More advanced substructure searching options use atom wildcards consisting of reserved atom types from the following list:
| Atom type | Query |
| A | Any atom |
| A- | Any atom with formal charge of -1 |
| A2- | Any atom with formal charge of –2 |
| A+ | Any atom with formal charge of +1 |
| A2+ | Any atom with formal charge of +2 |
| A3+ | Any atom with formal charge of +3 |
| M | Metal atom |
| R | Carbon or hydrogen |
| X | Oxygen or nitrogen |
| Q | Any atom except hydrogen or carbon |
| Z | Any atom except hydrogen |
| HAL | Any halogen |
| CAK | Chain carbon |
| CRG | Ring carbon |
| CAL | Non-aromatic carbon |
| CAR | Aromatic carbon |
| NAK | Chain nitrogen |
| NRG | Ring nitrogen |
| NAL | Non-aromatic nitrogen |
| NAR | Aromatic nitrogen |
| OAK | Chain oxygen |
| ORG | Ring oxygen |
| OAL | Non-aromatic oxygen |
| OAR | Aromatic oxygen |
| SAK | Chain sulphur |
| SRG | Ring sulphur |
| SAL | Non-aromatic sulphur |
| SAR | Aromatic sulphur |
| GP01 | Any element from periodic table group 1 |
| GP02 | Any element from periodic table group 2 |
| GP03 | Any element from periodic table group 3 |
| GP04 | Any element from periodic table group 4 |
| GP05 | Any element from periodic table group 5 |
| GP06 | Any element from periodic table group 6 |
| GP07 | Any element from periodic table group 7 |
| GP08 | Any element from periodic table group 8 |
| GP09 | Any first row transition element |
| GP10 | Any second row transition element |
| GP11 | Any third row transition element |
| GP12 | Any lanthanide |
| GP13 | Any actinide |
The similarity search uses a functional group fingerprint where each bit is set when the corresponding functional group is present in the molecule. The list of functional groups is defined using SMILES and stored in the file “key.smi” (stored in the THINK_EXEC directory). A Tanimoto similarity index is derived from the functional group fingerprints of the query and the molecule being searched according to the following formula:
![]()
where NC is the number of common bits set in the query and the molecule; NQ is the number of bits set in the query and NM is the number of bits set in the molecule.
Molecules are accepted as hits when the similarity index exceeds the specified cutoff.
This is a pre-requisite step to generate files containing R-groups prior to enumerating libraries, unless the user wishes to define all the R-groups manually. The implementation in THINK v1.25 is appropriate to generate R-groups for core-based and coreless libraries but cannot generate peptide building blocks with two generic connections. (This would require the SMILES to be written in an atom order matching that of the query). When enumerating a coreless library, THINK uses one group in place of a core. As a result, this section only considers core-based libraries but is equally applicable to coreless libraries. For most libraries, the choice of the core group has implications for the allowed R-groups. Occasionally, it is necessary to perform a substructure search as a preliminary step to identify certain R-groups.
The most convenient query is usually a reaction with reactants including substitution positions and product(s) which form the core group for enumeration.
![]() |
[1]C(=O)Cl+[2]N(H)H>[1]C(=O)N(H)[2] |
Each reactant is an R-group query consisting of a substructure with the desired connection points. During the search, atoms in the reagent that are matched to the substructure are deleted, and atoms matched to the connection points are replaced with generic or explicit connection atoms (see section 2.4). THINK v1.25 does not support graphical entry of the query and consequently it is usually created as a SMILES string, with the connection points indicated by:
For instance, a query to search for reagents of the form [R1](Cl)C=O, where [R1] is the required R-group, would have the form [0](Cl)C=O if the R-group is to have a generic connection atom (so the R-group can be used in any position) or [n](Cl)C=O for an explicit connection atom (which restricts the R-group to a single position).
Each connection point in the query is treated as a wildcard atom type during the search. The Z wildcard will match any atom type except hydrogen (this prevents the R-group search creating R-groups consisting of single hydrogen atoms). Use of the [wildcard] connection allows a more powerful search to be performed by restricting the connection point to a subset of atom types
If the query contains multiple connection points, it is recommended that they are specified as explicit connections using [n]. Where this is not possible (for instance if the [wildcard] form is required), the connection numbers may be defined by setting the group numbers of the connection atoms. Once the query has been read into THINK, the group number of an atom can be changed using the MODIFY CHANGE=atom GROUP=n command. If a search is performed using a query that contains two or more generic connections, THINK will detect the situation and use explicit connection atoms in the resulting R-groups. However, the order in which these connections are allocated is undefined, so the search may not give the desired results. It is therefore recommended that explicit connections are always used in these circumstances.
The search eliminates duplicate R-groups by omitting the second and subsequent copies of the R-group from the results file. In THINK v1.25 molecules with more than one occurrence of the substructure are skipped.
The 3D search has been designed principally for pharmacophore searching. In THINK the query consists of the substructure (such as there may be) of the specified molecule plus the inter-centre distances.
The 3D search starts by performing a 2D search to find the matching 2D substructure within the molecule being searched. Up to 1000 separate occurrences of the substructure will be processed; each is known as a substructure solution. A path test is used to eliminate substructure solutions that cannot possibly meet the query. This considers the connected path or chain between atoms in the molecule which correspond to atoms in the query that have a distance constraint. The maximum possible distance is calculated for this chain (ie if the chain were fully extended). The substructure solution is discarded when this distance is less that the minimum required by the distance constraint. For 1,2, 1,3 and 1,4 connection paths, the minimum possible distance for the chain is also estimated and used to eliminate substructure solutions.
Every distance check in a 3D search includes a small tolerance. The tolerance used for each check is the larger of the two following values: the CUSTOMISE TOLERANCE setting, or the sum of the radii of the two query atoms defining the distance (when this is enabled using the CUSTOMISE RADIUS=TOLERANCE command).
A conformational analysis is then performed using the current settings (see Chapter 4). A conformer which meets the geometry requirements is fitted to the query (using the usual Ferro Hermans fitting algorithm, preceded by an approximate match based on best-plane projections) and saved. The serial numbers of the conformer are modified to match the query or set to >512 for those atoms which do not correspond to any in the query.
The conformational analysis stage of a 3D search continues until all conformers have been scanned, checking all possible substructure solutions for each conformer. A new hit is created for each acceptable substructure solution for each conformer. If the results are being written to an SD file, all hits are saved in the file. If the results file uses the SMILES format, the best hit (the one that most closely matches the query) will be saved in the file. The maximum number of hits created for each conformer is controlled by the common variable ICFHMX, and the maximum number of hits for each molecule by the common variable IMLHMX. Both variables are normally set to zero, meaning that no limit is imposed.
Prior to performing the conformational analysis, the bonds being rotated are ordered so that bonds which do not change inter-atomic distances that match those in the query are incremented first, an d bonds for ring conformational analysis are changed last. This accelerates the conformational analysis because bond rotations that do not change critical inter-atomic distances may be skipped until a conformer is found which matches the query. Ring conformational changes are most computationally intensive, and by placing these at the end of the list they are repeated the minimum number of times.
When the search results file is viewed in THINK with the query, the conformers are not automatically refitted to the query with the consequence that rotations and translations of the query will ruin subsequent viewing.
The site search is a useful technique for virtual screening in order to select molecules that have a high probability of interacting with a protein. For a given protein receptor site, all the ways in which molecules might interact are not usually obvious. The query consists of a set of pharmacophore centres which complement those in the receptor and represent the ideal interactions with the receptor (see chapter 9 for information on creating a site query). These centres may consist of any mixture of the 10 standard (HDON, HACC, POS, NEG, ACID, BASE, AROM, LIP, MET, LPD) and 2 user-defined (USR1, USR4) centre types, and form the query for a site search. A molecule is required to fit any pharmacophore defined by 2, 3 or 4 of these centres (depending on the CUSTOMISE settings), whereas in a normal 3D search the molecule would be required to fit all the pharmacophores defined by the centres. In addition, the pharmacophore must exhibit the minimum pharmacophore score (see section 9.5.2) set by CUSTOMISE WARP=value and each centre must originate from different residues if CUSTOMISE RESIDUE=DIFFERENT has been issued.
In normal usage it would also be appropriate to have the active site residues present in a separate molecule in order to score the solutions (and eliminate those containing atoms which collide with the receptor). The receptor site residues are normally read from a PDB file and the complementary centres may either be in a PDB or SDF file. “PLUS” centres are automatically renamed to “POS” centres as they are read into THINK. The serial numbers of the centres usually map to receptor site atoms with which they interact (to aid visualisation), and this can require use of an extension to SD files in order to support serial numbers exceeding 999 (see section 2.2.2).
In an analogous manner to 2D and 3D searches (see sections 8.1 and 8.4), for each molecule the program first constructs a list of atoms that may match each centre. Permutations of 2, 3 or 4 centres are then constructed and the connected paths between these atoms evaluated to determine whether the observed separation (within the allowed tolerances) might be achievable in a conformational search. As in 3D searches, a tolerance is applied to each distance check. This generates the list of substructure solutions.
In addition, the area of the 3-centre pharmacophore or volume of the 4-centre pharmacophore is compared with the number of non-hydrogens atoms in the molecule. Where the volume (or area) is less than the product of the number of non- hydrogens atoms and the CUSTOMISE VOLUME (or CUSTOMISE AREA) setting the permutation is ignored. The conformational analysis is performed only if there are some permutations that might meet the distance and volume (or area) criteria.
For conformers which match a pharmacophore, the conformer is fitted to the receptor using the matched pharmacophore. For 2-centre pharmacophores, an additional pair of points is required to fit the conformer. These are generated from the centroid of the molecule and the centroid of the complementary centres defining the pharmacophore. The fitted conformer is scored using an extended ChemScore function:
![]()
where:
DG0,
DGhbond, DGlipo,
DGrot and DGbad
are constants (-5.48; -3-34; -0.117; 2.56; 0)
Nhbond is the number of qualifying interactions (on geometric
criteria)
Nlipo is the number of lipophilic-lipophilic contacts
Nrot is the number of frozen rotatable bonds in the molecule
Nbad is the number of lipophilic-hydrophilic contacts
and
EVdW is the VdW interaction energy between the molecule and protein, computed by the expression:

where:
e1
and e2 are constants
for the atoms
r1 and r2 are the VdW radii for the atoms
r is the distance between the atoms
and
Etors is the torsional energy of the molecule, computed by the expression:
where:
Vsng and Vcnj are constants (1.35; 3.72)
q is the torsion angle
Conformers that score less than the initial cutoff (common variable GSTMIN default 1000) are then refined using a Simplex minimiser to improve the estimated free energy of binding. The minimiser will adjust the position, orientation and exact conformation of the ligand within the active site. If the resulting conformer score is below the search cutoff value (common variable GSTHIT), the conformer is accepted as a hit. The search proceeds with the next substructure solution or continues the conformational analysis after finding a hit conformer. At the end of the conformational analysis the best hit (based on score) is saved to the results file. If all acceptable hits are required, the common variable ISTALL should be set to one, and the results file must use a 3D format (eg SD file).
Use of the Simplex minimiser to refine each hit conformer is important when attempting to reproduce crystal ligand-protein binding geometry. It is sometimes necessary to decrease the torsional increment by increasing the number of points about one or more classes of rotatable bonds (see section 4.1). It is conceivable that the optimised ligand will no longer fit the pharmacophore (within the tolerances) unless it is constrained to do so. This was suggested by C A Baxter et al. In THINK, when the deviation of the matched atoms exceeds the tolerance there is an additional contribution to the free energy consisting of K(d-t)2, where K is the value of the common variable GSTRES (default 1000), d is the deviation and t is the tolerance. Setting GSTRES to 0 disables the constraint.
8.6 Checkpoint File and Search Tracing
If a search is interrupted due to an unexpected halt in the application (eg a power failure), it can be resumed from the point at which it was interrupted. This is done by means of a checkpoint file called "search.dat" which is written to the current working directory during each search. The file stores information on the progress of the search together with details of query being used, the file being searched and the name of the results file. If the search is restarted in a subsequent THINK session using the same query and file names, the first part of the search will be skipped and progress will resume from the data stored in the checkpoint file. If the whole search is to be repeated, it is important that any existing checkpoint file is first deleted.
The checkpoint file will be deleted if a search is manually interrupted through <CTRL-C> or the Cancel button on the progress report.
It is possible to record the destiny of the molecules being searched to the THINK console window and log file. When searching a large number of molecules, this additional output can be voluminous and time-consuming. Consequently, the option should only be used when attempting to analyse why a molecule was not found, and it is recommeded that the number of molecules being searched is reduced to the smallest possible subset. The additional output is enabled through the common variable ITRACE - this is a bit mask, and setting bit 0 (value 1, in other words setting ITRACE to an odd number) will enable the extra output.