THINK reads and writes small molecule connection tables in the following formats:
The PDB format is designed for use with proteins and peptides. In this manual, the term SDF file will be used to refer to both molfile and SDfile formats.
Although THINK uses both 2D and 3D coordinates, it is not necessary for these to be included in the SMILES or SDF file. If the appropriate coordinates are not available, THINK will transparently generate 2D coordinates when required for display, and 3D coordinates when required for display, property calculation or conformational analysis.
The molecules in a combinatorial library may be automatically enumerated as they are saved to a file. This chapter includes a section on enumeration.
THINK also provides the ability to read and write selected parameter files, namely atom types, bond and angle parameters, ring coordinates and torsion angles (for use in conformational analysis).
File names which contain spaces should be enclosed in double quotes in scripts or when entered as a command line. File names should not start with "+" or "-" as these are special characters that force an append to or replacement of an existing file which can be useful in scripts.
The SMILES format was developed by Daylight Chemical Information Systems, Inc (www.daylight.com/smiles) and is a compact line notation that uses a hydrogen-omitted connection table. Blank lines and lines beginning with ! are ignored. A full description can be found on their website. The format has been extended for THINK to handle molecule names and field data.
This file uses one line for each structure. For the common elements of organic molecules (C, N, O, S, P, B, F, Cl, Br, I), known as the basic SMILES elements, the element symbol is used to indicate its presence with parenthesis for indicating branches:
eg CCC(=O)O
This example also indicates the use of = to signify a double bond. Triple bonds are shown by # and, where necessary, single bonds by - and aromatic bonds by :. For delocalised bonds, ~ has been reserved but is not generally supported by THINK. Lowercase letters are expected for atoms in aromatic rings, when aromatic bonds are assumed, although Kekule representations may also be used.
Elements other than the basic SMILES elements should be enclosed within square brackets []. Brackets also required if a formal charge or explicit hydrogen count is used:
eg [O-2] or C[NH2+]C
In such cases, where the value exceeds unity an integer must follow the +, - or H. A ring closure bond is indicated by an integer after the pairs of elements being bonded:
eg c1ccccc1
When a ring closure is used with square brackets, the ring closure digit(s) follow the ].
Atom types instead of element symbols may be used inside square brackets. Where these have the potential for conflict with other SMILES representations, they should also be enclosed within double quotes :
eg [CSP3] or [N+]
For combinatorial chemistry, the ability to store R-groups and define R-group queries is important. The atom type 0 is used for a generic connection atom and types 1-9 are used for explicit connection atoms (see section 2.4). Since these are non-standard atom types, they must be enclosed in square brackets [] when used in SMILES files.
For R-group queries (see section 8.3), the < is used between the atoms which are to be retained as the R-group and those which are discarded eg C1<N1([H])[H] which would generate R-groups from primary amines with a generic connection atom (atom type 0) replacing the nitrogen. Note that in these circumstances it is important to specify the hydrogens connected to the nitrogen. Bonds between the two sets of atoms are shown using ring closure digits.
The group number for disconnected fragments may be specified by an integer (in the range 0-9) immediately following the .. This is rather useful when defining queries for R-group searches eg C1.3C2<C1N12[H] that will generate two R-groups from secondary amines.
Numeric field values may be stored on the same line after the SMILES string in the form:
field=value
where field is the name of the field and value is a real number or integer. Note that there are no spaces around the =. The reserved field NAME is used to indicate the molecule name. (THINK v1.25 does not support text variables).
There is no support for storing atom coordinates.
These formats were created by MDL Information Systems, Inc (www.mdli.com) and are widely used by various programs. A full description can be found on their website.
Although THINK can read both molfiles and SDfiles, it can only write SDfiles.
The heart of a molecule definition within a molfile or SDfile is the Ctab connection table block. This is a fixed-format block of records containing information about the atoms, bonds and properties (eg formal charges) of the molecule.
A molfile consists of a Ctab block preceded by three header lines, and stores a single molecule. The molecule name may be stored in the first header record, or in the second header record in place of the molecules internal registry number (starting in column 47). If the name occurs in both records, the name from the second record will be given priority. The CUSTOMISE NAME=field command allows this priority to be reversed or a field to be used for the molecule name.
An SDfile consists of multiple molfile blocks, each followed by a record containing $$$$. Any field data associated with a molecule is placed between the molfile block and the terminating $$$$ record. The data for each field must be preceded by a record starting with > and containing the name of the field enclosed in angle brackets <>.
THINK uses the following FORTRAN format to store the atom data:

1 2D for 2D files; 3D for 3D files. An extension, also used by Chem-X, allows both 2D and 3D coordinates to be stored within the atom records in the Ctab block. The second set of coordinates is placed in columns 76-105 using the format 3F10.4
2 Element symbol for normal atoms; atom type name for dummy atoms. For dummy atom types with 4-character names (eg AROM, BASE), the 4th character is written into the preceding blank space
3 Code indicating formal charge: 4-q, where q is the formal charge
4 Code indicating chiral parity
5 Code indicating hydrogen count: nh+1, where nh is the number of missing hydrogen atoms
6 The atom mapping number is used as the serial number (if non-zero) and if necessary is allowed to exceed the 3 allocated characters to overwrite the reactant/product field. The reaction component type (reactant, product or reagent) is used for the group number
The extensions allow SDfiles to be used to store R-groups and R-group queries, although the more compact SMILES format would normally be used. However, THINK v1.25 cannot save R-group queries in a SMILES format file.
The query constructs used by Chem-X, ISIS and other software are not currently supported.
THINK v1.25 does not support the use of text fields.
The PDB format is the de facto standard for 3D protein coordinates. Unlike MDL SDfiles, the amino acid or residue information is included. See www.rcsb.org/pdb for further information on this format.
A subset of the standard keywords are processed:
| Keyword | Description |
| ATOM | Atoms with standard amino acid residue names |
| HETATM | Atoms in non-standard residues |
| CONECT | Connectivity records for non-standard residues |
| COMPND | For molecule name |
| HET | To identify non-standard residues |
| HETNAM | Names of non-standard residues |
| HETSYN | Synonyms for names of non-standard residues |
| HELIX SHEET TURN |
Sets of residues comprising secondary structures |
| SITE | List of residues comprising an important site in the molecule (eg active site) |
An additional keyword, NAME, is also interpreted. This is an extension used by THINK, and must be manually inserted into the PDB file using a text editor (see section 2.3.2 below).
Unlike SDF and SMILES format, hydrogens are not automatically added unless the HYDROGENS option is specified.
A symbol is automatically created for each helix, sheet strand or turn defined in the PDB file. The symbol contains the range of residues that define the secondary structure unit, identified by their sequence numbers and, where necessary, the chain ID eg (124:127) or (266:272)B. The symbol name has the form HELIXnn-dd, SHEETnn-dd or TURNnn-dd, where nn is the number associated with the secondary structure (read from columns 8:10) and dd is the secondary structure's identifier (read from columns 12:14).
THINK will automatically create a symbol for each site definition found in the file. The symbol consists of an array called SITE-xxx where xxx is the site identifier (read from columns 12:14 of the SITE record). Each residue in the site is identified by its name, sequence number and, where necessary, chain ID and is placed in a separate array element.
The NAME keyword is used to define the molecule names. NAME records are added to the PDB file using a text editor. If used, a separate NAME record must be inserted at the beginning of every molecule in the file. If a molecule has any CONECT records, these should appear in the same section of the file as the molecule's ATOM or HETATM records (ie before the NAME record for the next molecule). This may require moving some of the CONECT records.
A NAME record should contain the keyword NAME in columns 1:6 and the molecule name in columns 7 onwards.
Older versions of THINK (v1.12 and earlier) would read the molecule name from the COMPND record. This has been superseded by the introduction of the new format of COMPND records with version 2.0 of the PDB format (see www.rcsb.org/pdb and Molecule Names below).
Atom radii may be read from and written to the temperature or B-Factor field (columns 61:66) in ATOM and HETATM records. This may be useful to set an atomic tolerance for site search queries.
Atom group numbers are stored in columns 67:69 in ATOM and HETATM records.
2.3.3 Molecules and Molecule Names
If NAME records have been added to the PDB file, they will be used to define the molecule names. In the absence of NAME records THINK will take the molecule names from the following records:
| Keyword | Columns in record |
Description |
| HEADER | 63:66 | PDB protein ID code |
| COMPND MOLECULE: | 20:70 | Macromolecule name |
| COMPND SYNONYM: | 19:70 | Synonym for macromolecule name |
| COMPND EC: | 14:70 | Enzyme Commision number associated with molecule |
| HET | 31:70 | Description of the HET group |
| HETNAM | 16:70 | Chemical name of HET group |
| HETSYN | 16:70 | Chemical name of HET group |
Continuation COMPND records will be ignored, and only the first MOLECULE, SYNONYM or EC record will be used.
THINK will automatically create a separate molecule for each ligand, and a single molecule containing all the water molecules in the PDB file providing there is a HET record for each ligand and there are no NAME records in the file. The name of each ligand molecule will be taken from the HET record or its associated HETNAM or HETSYN records. The water molecule will be called xxxx-WATER where xxxx is the PDB protein ID code. If the file contains two or more ligands with identical names, THINK will add a counter of the form "(n)" to the molecule name to distinguish the molecules.
It is recommended that NAME records are used when the ligand binding uses one or more water molecules. The water molecules involved should be included as part of the protein when constructing a site query. NAME records should also be used when the ligand is a peptide (since it will not appear in a HET group so THINK will not detect it as a separate molecule unless a NAME record is used).
Connectivity and bond order is deduced from the amino acid residue names and atom names with the consequence that the use of non-standard amino acids (including nucleic acids) or atom names will result in the connectivity being incorrect. Although THINK processes the CONECT records in the file, these do not include bond-order and THINK does not automatically generate this information (for instance from the inter-atomic distances). The command MODIFY REBUILD=CONNECTIONS can be used to regenerate the connectivity, but this deduces the bonds and bond order from the atom coordinates, and therefore will only give the correct connectivity if the atom geometries are accurate.
Normally the molecules saved to a file are identical to those in memory. To generate and save enumerated molecules it is first necessary to read all the R-groups into memory (including any core or central group). The enumeration code permutes the molecules which have been read from the specified files, pairing the explicit connection atoms (those with atom types 1-9) and converting generic connection atoms (those with atom type 0) to explicit connections according to various rules.
There are three conceptually different types of library that might be built:
When generic reaction has been used to search for R-groups, then the product(s) are the core group(s) and can be used during the enumeration. Alternatively, a SMILES file containing just the core group(s) can be created with explicit connection atoms (using atom types 1-n, n £ 9). When creating R-group files (see section 8.3), it is important to use file name conventions that avoid confusing the reagents and the various R-groups that can be derived from them. The scope for confusion is greatest when more than one R-group may be derived from the same reagent.
The R-groups are grouped according to the file from which they were read. The order of the R-groups is specified as part of the enumeration instructions issued when the molecules are saved. Regardless of the type of library being built, THINK will always treat the first file specified as containing the core group. The R-groups from the second file will be connected to the core group; those from the third file etc may be connected to the core or to the preceding R-groups, depending upon the type of library.
The following rules governing the conversion of generic to explicit connections may be useful when attempting to locate errors in R-groups, or to deduce possible library representations:
Errors will be reported for R-groups with more than two generic connections, or if there are unpaired connection atoms (except for the first and last connection atom when building peptide chains).
In order to use the same file of R-groups in several different libraries, the concept of a generic connection (atom type 0) is available. Each R-group should only contain one generic connection and it is critical to specify the R-groups in the correct order following the core group in the enumeration command. If an R-group has more than one connection to the core it is advisable to use explicit connections, and often easier to place such groups as the last R-group. When a mixture of explicit and generic connections are used in R-groups, then each generic connection is converted to the next unused connection taken in the order of the R-groups (and this may leave gaps).
In principle, the core may also use generic connections when they are converted into connection atoms 1-9 based on the order in which they occur. As this introduces scope for error it is not recommended; instead the core group should have explicit connection atoms with types 1-9 referring to the R-groups
Many simple two-component libraries such as amides can be represented without a core. When using THINK, either group may be specified as the core and the other group as the first R-group. Both groups may have a single generic connections or explicit connections (which must match).
Although peptide libraries can be represented as a backbone core with the amino acid side-chains as the R-groups, it is often more convenient to use the backbone and side- chain as a building block with two connections. For this to be used in all positions (including the core) it must have two generic connections eg [0]NC(C)C(=O)[0] and it is usually appropriate to use a capping group (eg OH) at the end of the chain as the last R-group.
The implementation includes two important but subtle rules
The connections are numbered according to the order in which they occur within the R-group. It is important that the amino acid R-groups are specified in a consistent direction problems will arise if some are specified as N to C and others as C to N.
THINK includes the ability to read and write the following parameter files:
All the files use fixed formats, as described in Appendix A.
If the user wishes to make changes to the parameters, it is envisaged that the existing data will first be saved to a file, which is then edited before being read back into THINK. This will help to ensure that the correct data and formatting is supplied.