Chapter 8


9    Creating Site Queries

The approach used to generate a site query for a protein depends on whether the coordinates of a ligand docked into the active site are available. If they are not then it is essential to have a list of the residues which constitute the active site (THINK does have functionality to search for potential binding sites).

9.1 Complementary Centres

THINK will identify the key atoms within the active site and insert the appropriate complementary centres, also known as site points, after checking atom overlap and accessibility as described below. The centres are placed in a new molecule called INTERACT. It is recommended that this molecule is saved in a PDB file for re-use in later searches.

The following table shows the atom types of the key atoms and their corresponding complementary centres (see section 7.1 for a description of the codes used to describe the centres). The position codes (A,B,C) identify where the centres will be inserted, and are described after the table. Each centre is allocated a VdW radius of 1.0Å, which is used to determine the tolerance used in the distance checks during site searches (see section 8.5).

Atom type
No. of non-H
connections
Tautomeric?
Complementary centres
Position
code
OSP2
1
N
HDON, HACC
A
Y
HDON, HACC, POS
B
2
N
HDON
A
O-
1
N
HDON, POS
A
OAR
2
N
HDON
A
OSP
1
N
HDON, POS
B
Y
HDON, HACC, POS
B
NSP2
1
N
HACC
B
Y
HDON, HACC
B
2
N
HACC
A
Y
HDON, HACC
A
ND
1
Y/N
HDON, HACC
B
2
N
HDON
A
Y
HDON, HACC
A
N+
1
Y/N
HACC, NEG
A
2
Y/N
HACC, NEG
A
3
Y/N
HACC, NEG
A
ND+
1
Y/N
HACC
B
2
Y/N
HACC
A
NSP3
1
Y/N
HDON, HACC, NEG
A
2
Y/N
HDON, HACC, NEG
A
3
Y/N
HDON, HACC, NEG
A
4
Y/N
NEG
A
NAR2
2
N
HDON
A
Y
HDON, HACC
A
NAR3
2
N
HACC
A
Y
HDON, HACC
A
AR5
-
-
AROM
C
AR6
-
-
AROM
C

With position code A, the complementary centres are inserted at a single position 3.0Å from the defining protein atom projected along a vector defined by the sum of the non-hydrogen connections to that atom.

Position code B inserts two sets of complementary centres. The two insertion points and the defining atom form an equilateral triangle with the centre at the apex. The sides of the triangle are 3.0Å and the base (between the two insertion points) is 2.0Å. The midpoint of the base is positioned on the vector projected along the non-hydrogen connection to the defining atom.

Position code C is used with aromatic rings and inserts two centres 3.2Å above and below the plane of the ring (positioned directly above/below the defining AR5 or AR6 atom).

THINK creates several complementary centres at a single insertion point because the defining atom often has more than one centre attribute. For instance, O- is both a hydrogen bond acceptor and a negative charge centre; consequently it requires two complementary centres (HACC and POS).

Insertion points which are inside the VdW radius of any atom in the protein molecule are eliminated. This can be adjusted by modifying the common variable RADCEN (which is zero by default). A negative value permits some overlap with the protein while a positive value might allow for the absence of hydrogens.

An accessibility check is applied to each insertion point by examining the vector from the defining protein atom to the insertion point (the insertion vector). If the vector points away from the centre of the active site, the defining atom is unlikely to be accessible by a ligand and therefore its complementary centres are not created. The check is actually performed by determining the angle between the insertion vector and the vector from the centroid of the active site to the protein atom. If this angle is less than 60° the insertion point is rejected. The angle can be changed by storing the cosine of the minimum permitted angle in the common variable COZCEN.

9.2   Existing Ligand-Protein Complex

The most common existing ligand-protein complexes used in validation are X-ray crystal structures, although interactively docked ligands and ligands docked using other software may also be used. THINK can deduce the active site from the position of the ligand, so this technique can be used with PDB files that do not contain a definition of the site.

If the ligand and the protein exist in the same PDB file it may be necessary to edit the PDB file and insert records containing the NAME keyword to allocate names to the molecules and indicate where additional molecules start. In some PDB files, where the ligand is not a peptide, THINK will automatically recognise the separate molecules providing the appropriate HETNAM records are present. When editing PDB files, it is important to move any HETATM and CONECT records to the appropriate place to be associated with the correct molecules defined by the NAME records. If there are connections between the protein and ligand, these must be removed by editing the file.

It is recommended that all non-essential water molecules are deleted before the site query is generated. In some cases, however, the ligand may bind to the protein via one or more essential water molecules. If the user decides to retain these essential molecules, the remaining water molecules must be removed by editing the PDB file (remembering to move any CONECT records into the right place).

In order to create the complementary centres or site points, THINK first needs to define the active site by identifying residues that contain at least one atom in contact with the ligand. In most cases it is appropriate to use a contact radius which is larger than the VdW radius by about 25% to allow residues that are close to, but not in VdW contact with, the ligand to be included in the site. The ratio of the contact radius to the VdW radius is specified as part of the command to create the complementary centres.

MODIFY INTERACTION=^1ABC LIGAND=ligand

THINK then creates complementary centres for all key atoms within the residues in the active site, subject to the overlap and accessibility checks described in section 9.1 and places them in a new molecule called INTERACT. For large sites, this can give a large number of centres which significantly increases the time required for a site search. However, the presence of a ligand allows THINK to allocate a weighting to each centre reflecting its distance from the ligand, and hence the likelihood that its originating protein atom is involved in the ligand binding. The weights are stored as group numbers for the centres, and are described in section 9.5.3.

9.3   List of Active Site Residues

If the PDB file does not contain a bound ligand, the set of residues in the active site must be supplied when a site query is created. The set may be defined by:

The comma-separated list must not contain any spaces

9.3.1  Site Records in PDB file

When the active site is defined by SITE records in the PDB file, THINK will interpret these records and store the definition as a list of residues using symbols of the form SITE-name(n) where name is the PDB identifier for the SITE and n is a counter for the list of residues. If this information is missing from the PDB file, it can be specified using commands. For example, to create a site with an X identifier defined by three residues with sequence numbers 905, 982 and 909, then the following commands would be used:

LET SITE-X(1) := (905)
LET SITE-X(2) := (982)
LET SITE-X(3) := (909)
LET SITE-X(4) := " "
MODIFY INTERACT=_SITE-X

The blank stored in SITE-X(4) terminates the list of residues. It is essential if the SITE-X symbol is being re-used with a new definition - if it was omitted, the commands would be interpreted as re-defining only the first three elements of SITE-X, not the whole symbol. If these commands are issued in the console window then it is not necessary to specify global symbols (using ":=") rather than local ones. However, if the definition is stored in a logfile global symbols must be used.

9.3.1  Finding Binding Sites

When the binding site is unknown, THINK can be used to find all plausible binding sites and stores the results in a symbolic array of the form SITE-x(n) where x is the site number and n is a counter for each residue in the site. The following algorithm is used:

  1. Construct a grid of points around the protein
  2. Mark each grid point that lies inside the protein as an "inside" point
  3. Mark each point in the grid that lies well outside the protein (ie is not close to the protein surface) as an "outside" point. This is done by centring a 4Å sphere on each unmarked point and examining the grid points that fall within the sphere. If the sphere does not contain any "inside" points, all the points in the sphere are marked as "outside"
  4. Any points that have not been marked as "inside" or "outside" by this stage could potentially lie within an active site. THINK organises these into groups of contiguous unmarked points, rejecting any group that does not contain at least 3 points, and identifies the residues that lie around the edge of the group as those defining the active site. A residue must have a heteroatom within 4.6Å of 3 or more points in the group to be included in the active site. Any active site that contains fewer than 3 residues is rejected.
  5. Sort the active sites into descending order of size and create the SITE-x arrays, with the largest site in the SITE-1 array.

The following command is used:
MODIFY SITE=protein

9.4 Extended Active Site

After the INTERACT molecule containing the site points has been created, THINK modifies the protein by eliminating residues which are so far from the active site that they cannot make any contribution. Reducing the protein to an "extended site" speeds up the search times, and is essential when THINK is used in a distributed computing environment because there is less data to transfer between the servers and the clients. Therefore, it is recommended that the extended site is saved in a PDB file for re-use.

An extended site comprises all residues that contain at least one atom within a given distance d of any complementary centre in the INTERACT molecule. The distance d is defined as (8.0+Ra+Rc) where Ra and Rc are the VdW radii of the atom and centre respectively. The value of the additional distance, 8.0Å, is stored in the common variable RADSIT and may be altered if desired.

The following command would create the site points for a protein called 821P with a bound ligand in a molecule called BOUND. The resulting query molecule (INTERACT) and extended site are saved in PDB files so they can be re-used in multiple searches.

MODIFY INTERACT=^821P LIGAND=BOUND RATIO=1.25
FILE SAVE=821P-Q1.PDB MOLECULE=INTERACT
FILE SAVE=821P-SITE.PDB MOLECULE=821P

9.5   Centre Selection (Weighting)

The choice of pharmacophores to be used in the search is influenced by the weightings of the complementary centres (site points). There are three different types of weightings:

These weightings are combined to select the best pharmacophores from all possible permutations of the site points during the search.

The weightings are stored as group numbers for the complementary centres, and may be set automatically by THINK and/or manually by the user. The group numbers are also stored in PDB files (as an extension to the standard format - see section 2.3.2). They are often useful when manually editing the query file to reduce the number of centres, as they can be used to guide the choice of centres to be removed.

9.5.1  Required Groups

Once the complementary centres have been created, they can be reviewed to locate the points most likely to be involved in the ligand binding. These may be grouped to create a list of alternatives for each point in a pharmacophore. This can be important when molecules are required to interact with one or more residues. When THINK constructs the pharmacophores from the site query, it will check for these lists of alternatives and construct the pharmacophores appropriately.

Up to four sets of complementary centres may be defined, identified by their group numbers. The group number is used as a bit mask, which means that the sets may overlap by using group numbers that correspond to two or more bits being set (adding the individual group numbers together).

Required group Bit Group number
1 4 16
2 5 32
3 6 64
4 7 128

For instance, a centre with group number 16 would be included in required group 1 only, whereas a centre with group number 48 (16+32) would be included in required groups 1 and 2.

THINK automatically constructs the lists of alternative pharmacophore points by scanning the group numbers of all complementary centres. All centres that do not fall into one of the above groups are placed into group 0. This group is used to provide pharmacophore points that cannot be selected from one of the user-defined groups. THINK ensures that all the centres used in the pharmacophore have been derived from different atoms in the protein or ligand by checking their serial numbers.

9.5.2  Pharmacophore Score

Once the pharmacophores have been constructed (taking the required groups into account) THINK allocates each an initial score of 100. This score is reduced by the distance, acid and base weightings to give the final score for each pharmacophore. Only pharmacophores with scores greater than the common variable PSCOMN will be used during the search. This value is changed using the CUSTOMISE WARP command.

9.5.3  Distance Weightings

If the site query is derived from an existing ligand-protein complex, THINK automatically allocates a distance-based weighting to each complementary centre generated. This reflects the distance of the centre from the ligand, and hence the likelihood that its originating protein atom is involved in the ligand binding. These weights are also stored as group numbers and can be used in conjunction with the required groups described above by adding the group numbers together.

Group number Interpretation
1
Centre within 1.0Å of a ligand atom
2
Centre in contact with ligand atom (separation is less than 1.0+Ra where Ra is the VdW radius of the ligand atom)
3
Centre is within 6.0Å of a non-hydrogen ligand atom and there are no intervening non-hydrogen protein atoms between the centre and 3 or more non-hydrogen atoms in the ligand (ie "line of sight")
4
Centre is more than 6.0Å from all non-hydrogen ligand atoms, or there is no line of sight to the ligand
5
Centre has fewer than 5 residues infront of it (based on the plane perpendicular to the vector from the centre to the originating atom).

When a query is created from a binding site weightings are still applied although these critically depend on the centroid of the residues defining the site.

Group number Interpretation
1
Site point within 6.0Å of site centroid
2
Site point within 8.0Å of site centroid
3
There are no intervening non-hydrogen protein atoms between the site point and the site centroid (ie "line of sight")
4
There is no line of sight to the site centroid
5
Centre has fewer than 5 residues infront of it (based on the plane perpendicular to the vector from the centre to the originating atom).

For instance, a site point may be allocated a group number of 19, which means that it is included in required group 1 (value 16) and has a distance weighting of 3.

When a centre with a distance weighting is used in a pharmacophore, the pharmacophore's score is divided by the distance weighting (group number). For example, a pharmacophore containing four centres with distance weightings of 1, 2, 2 and 3, the score for that pharmacophore would be 8.3 (100 divided by 1*2*2*3). The pharmacophore score is important because those scoring below PSCOMN are ignored.

9.5.4  Acid and Base Weighting

The acid and base weightings are used in conjunction with the distance weightings to accept only acidic and basic groups that lie very close to similar groups in the original ligand, and therefore are likely to interact in the same way.

Most acidic groups contain a hydrogen donor, so an acid group in a potential ligand may be matched to an ACID or HDON site point. Similarly, basic groups normally contain a hydrogen acceptor, so may be matched to a BASE or HACC site point. THINK prefers to match an acid group to an ACID point, and a basic group to a BASIC point, so penalises pharmacophores where the groups are matched to HDON or HACC points by reducing the pharmacophore score.

The penalty for acid groups is stored in the common variable PSACID, and for basic groups in the variable PSBASE. Both variables contain a value in the range 0-1. For each HDON point in the pharmacophore matched to an acidic group, the pharmacophore score is multipled by the value in PSACID. Similarly, for each HACC point matched to a basic group the score is multipled by the value in PSBASE.

9.6   Reproducing Ligand-Protein Complexes

Attempting to reproduce a known ligand-protein complex can be a useful way of validating the results generated by THINK. In some cases creating the site query using the default options and then running the site search using the ligand as the search molecule reproduces the geometry of the original ligand, but this is not always the case. For instance, some ligands may bind using only weak interactions (no hydrogen bonds or charge interactions), or may bind through a water molecule that was missed when the site query was created. It is possible that the crystallographic data shows the mean position of the ligand, where there are no strong interactions, and that the ligand may move between positions where hydrogen bonds etc may be formed. THINK also takes no account of the flexibility of the side chains within the active site.

9.6.1  Visual Inspection

THINK can identify the key interactions between the ligand and the protein using the following command:

DISPLAY MODE=3D MOLECULE=protein QUERY=ligand OPTIONS=INTERACTIONS

which lists the interactions to tbe console window and draws dashed lines between atoms to represent the interactions. The interactions are a subset of the contacts between the ligand and the protein: only contacts between atoms that are complementary pairs of interaction centres are included:

For proteins without hydrogen atoms, THINK will detect hydrogen bond donor atoms despite the absence of the hydrogens (normally a hydrogen bond donor requires at least one attached hydrogen atom unless the CUSTOMISE DONOR=NOHYDROGEN command has been issued). Any water molecules present in the protein will be included in the list of interactions, and THINK will show hydrogen-bonding pathway through water molecules (see section 2.3.3 for details of how water molecules are handled when a PDB file is read). If a search fails to reproduce a known complex, it may be worth re-examining the site with the water molecules included in the protein in case any are involved in the binding.

When determining the contacts, THINK uses an enlarged VdW radius for each atom. By default, a factor of 1.25 is applied to each radius. However, if "super-CPK" radii are in effect, these will be used instead (super-CPK radii are enabled by setting the ratio of the CPK to VdW radius to a value greater than 1.25 using the CUSTOMISE RATIO keyword).

9.6.2  Check Site Points

Comparing the complementary centres (site points) in the query with the set of interactions reported from the ligand-protein complex may reveal some interactions that do not have corresponding site points. These points were probably eliminated when the query was created because:

The rules controlling the insertion of site points are described in section 9.1. Changing the common variable RADCEN to a negative value will allow THINK to create site points inside the protein.

The accessibility check may not be appropriate for a large or irregularly shaped active site, in which case it may be relaxed by setting the common variable COZCEN to a value closer to 1.0 (the default value is 0.5). The check can be disabled by setting COZCEN to zero.

9.6.3  Check Pharmacophores and Conformers

It is not uncommon for a ligand to have only two or three points that interact with the protein. For these examples the number of centres in the pharmacophores must be reduced from the default value of four. Use of a 3- or 4-centre pharmacophore is normally sufficient to force the ligand into the desired conformation, but a 2-centre pharmacophore often leaves scope for much of the molecule to adopt a different conformation. Consequently, comparing the site search hits with the docked ligand may not be encouraging.

The results of a site search may be analysed in the spreadsheet. To obtain the maximum information, the protein, the original bound ligand (obtained from the crystallographic data) and the 3D coordinates of the search hits must be read into THINK before the spreadsheet is calculated. This means that the search results must be stored in an SD file.

The spreadsheet is generated by the command:

LIST INFO=PROPERTIES OUTPUT=WINDOW QUERY=ligand

It may be necessary to restrict the properties calculation to the hit molecules by adding the keyword "MOLECULE=hits" to the command. THINK automatically computes the following RMS deviations of each molecule from the ligand (as specified with the QUERY keyword) and stores the values in fields within the spreadsheet:

Field name Description
RMS-All Deviation of all atoms, based on atom order within the molecules - only computed if both molecules have the same number of atoms
RMS-NoH Deviation of all non-hydrogen atoms in hit from corresponding atom (based on atom order) in ligand - only computed if both molecules have the same number of atoms
RMS-Match Deviation of all atoms with matching serial numbers (excluding those in group 999)

The RMS-ALL and RMS-NOH fields are intended for use in validation studies when the original ligand (as taken from the crystallographic data) is used as the molecule being searched. These fields show the deviation of the hit conformers from the original conformation.

The RMS-MATCH field can be used with any hit molecule found by a search using a site query derived from the original ligand, regardless of the number of atoms in the molecule. It uses the serial numbers of the atoms to establish their correspondence (the serial number of an atom in the hit molecule is taken from the matching atom in the query, which is itself copied from the defining atom in the original ligand). Atoms in group 999 are omitted as they were not matched to an atom in the site query.

There is a fourth RMS field in the spreadsheet, called "RMS". This shows the RMS deviation of the matched atoms in the hit from the corresponding atoms in the query (not the original ligand). Its value is computed during the site search and is saved in the search results file.

If the active site is present when the spreadsheet is computed, THINK will re-calculate the docking score for each hit molecule (see section 8.5) and include it, together with a breakdown of the score, in the following fields:

Field name Description
G-TOT Total docking score
G-HB Hydrogen-bonding component (DGhbond*Nhbond)
G-LIP Lipophilic contacts (DGlipo*Nlipo)
G-ROT Frozen rotatable bonds (DGrot*Nrot)
G-NB Non-bonded component (EVdW)
G-BAD Lipophilic-hydrophilic contacts (DGbad*Nbad)

It may be necessary to explore various search options using trial and error in an attempt to understand why THINK is failing to dock a known ligand. The time taken to do this can be reduced by eliminating site points that do not interact with the protein. Such points can be identified by their group number: those in group 4 are too far away from the protein to contribute to any interaction. It is also recommended that the trace option is enabled (by setting the common variable ITRACE) during the trials.

LET #ITRACE = 1025

Setting ITRACE to 1025 will provide additional output which usually indicates at what stage in the search the molecule is eliminated (see Appendix D). If the explanation of why the search failed is "Atoms not matched", it implies one of the following:

In other cases, the explanation is "no suitable conformers" which is indicative of either the torsional increments being too large or the conformers being eliminated due to contacts either within the molecule or between the molecule and the protein. It is not unusual to reduce the ratio of the CPK to VdW radius from 0.6 to 0.4 to remove some of the contacts. For crowded conjugated systems, the torsional sampling about conjugated bonds may be increased to 4.

In the event of a conformer meeting the geometry constraints but failing to generate an adequate score adequately, further explanations will be provided. Significant numbers of bad contacts between the ligand and the protein may lead to the explanation "score too high to refine". In such cases, it is possible to increase the maximum number of cycles of minimisation by increasing NCYCMX. With 3- and 4-centre pharmacophores, it may be preferable to increase the number of torsion increments to provide a better starting geometry rather than increasing GSTMIN above the default of 1000. (GSTMIN is the maximum permitted score for a molecule to enter the refinement stage: increasing this value will lead to more molecules being accepted for refinement, thereby increasing the overall search time). Using a SEARCH CUTOFF value that is equal to GSTMIN will eliminate the refinement step, which can be useful to provide output for visually checking that the appropriate solutions would otherwise be generated.


Appendix A