Methods Background: Topological Indices

This chapter presents brief background information on some of the indices computed by Molconn-Z. The user is referred to the materials cited below for more detailed information.

**Definition:** Molecular connectivity is a method of molecular structure
quantitation in which weighted counts of substructure fragments are
incorporated into numerical indices. Structural features such as size,
branching, unsaturation, heteroatom content and cyclicity are encoded.

- L. B. Kier and L. H. Hall, "Molecular Connectivity in
Structure-Activity Analysis", John Wiley and Sons, New York (1986).

- L. B. Kier and L. H. Hall,
*Eur. J. Med. Chem.*,**12**, 307 (1977).

- L. B. Kier and L. H. Hall,
*J. Pharm. Sci.*,**70**, 583 (1981).

- L. H. Hall and L. B. Kier, Bull.
*Environ. Contamn. Tox.*,**32**, 354 (1984).

- L. H. Hall, "
*Computational Aspects of Molecular Connectivity and its Role in Structure-Activity Modeling*" in Computational Chemical Graph Theory, D. H. Rouvray, Ed., Nova Press, (1989).

- L. H. Hall and L. B. Kier, "
*The Molecular Connectivity Chi Indices and Kappa Shape Indices in Structure-Property Modeling*", in Reviews of Computational Chemistry, Volume 2, D. B. Boyd and K. Lipkowitz, eds. (1991).

= - h,

where is the count of electrons in orbitals and h is the count of hydrogen atoms. This simple value of an atom is equal to the number of neighboring atoms in the molecular skeleton. The values of each atom are subsequently used in calculating the simple molecular connectivity indices. The valence electron descriptor is given as

^{v} = Z^{v} - h,

where Z^{v} is the count of the valence electrons.
A more general expression for ^{v} which includes atoms in the 2nd, 3rd,
and 4th rows of the periodic chart is

^{v} = (Z^{v} - h)/(Z - Z^{v} -1),

where Z is the count of all electrons, the atomic number. The valence delta values are used in calculating the valence molecular connectivity indices.

The molecular connectivity indices or chi indices are symbolized
^{m}_{t}.
Substructures for a molecular skeleton are defined by the decomposition
of the skeleton into fragments of:

a) atoms (zero order, m = 0);

b) one bond paths (first order, m = l);

c) two bond fragments (second order, m = 2);

d) three contiguous bond fragments (third order Path, m = 3, t = P);

and so forth. Other fragments include the cluster
(three atoms attached to a central atom, m = 3, t = C);
the path/cluster (equivalent to the isopentane skeleton, m = 4, t = PC);
the chain fragment (cycles of 3, 4, 5 . . . atoms, m = 3, 4, 5. . ., t = CH).

For each order and fragment type, a connectivity index may be calculated.
This calculation is made by multiplying the
(or ^{v}) values for each
atom in a fragment within a molecule. This product is then converted
to the reciprocal square root and called the connectivity subgraph
term c_{i}. These terms are then summed over all the subgraphs
(of order m and type t) in the entire molecule, Ns, to calculate
the molecular connectivity index
^{m}_{t}
of order m and type t.

m+1 NsThe valence molecular connectivity indices^{m}c_{i}= (_{k})^{-0.5}and^{m}_{t}=^{m}c_{i}k=1 i=1

The calculations are made from input information which includes the connection matrix, designation of the atom type, and the count of bonded hydrogens to each atom. The valence delta values are determined in the program from the formalism given above.

Structural Formula | |

Hydrogen-suppressed Graph or Molecular Skeleton |

To calculate ^{1}^{v}, dissect the molecule into all its one-bond fragments,
shown below; each bond fragment is labeled with the
^{v} values.

For each fragment compute the subgraph contribution,

(^{v}_{i}*^{v}_{j})^{-0.5},

and sum over all the bond fragments:

^{1}^{v}= 3.6175.

In an analogous manner the other connectivity indices can be calculated for path, cluster, path/cluster and ring type subgraphs. The finding of all the subgraphs of a given type becomes a difficult computational problem by hand. That is, of course, the reason for the development of Molconn-Z.

**Definition:** The Kappa shape indices are the basis of a method of
molecular structure quantitation in which attributes of molecular
shape are encoded into three indices (Kappa values). These
Kappa values are derived from counts of one-bond, two-bond and
three-bond fragments, each count being made relative to fragment
counts in reference structures which possess a maximum and minimum
value for that number of atoms.

- L. B. Kier,
*Quant. Struct.-Act. Relat.*,**4**, 109, (1985). A Shape Index from Molecular Graphs. - L. B. Kier,
*Quant. Struct.-Act. Relat.*,**5**, 1, 7, (1985). Distinguishing Atom Differences in a Molecular Graph Shape Index. - L. B. Kier,
*Quant. Struct.-Act. Relat.*,**6**, 8, (1987). - L. B. Kier,
*Acta Pharm. Jugosl.*,**36**, 171, (1986). Indices of Molecular Shape from Chemical Graphs. - L. B. Kier, "
*Indices of Molecular Shape from Chemical Graphs*" in Computational Chemical Graph Theory, D. H. Rouvray, Ed., Nova Press, (1989). - L. H. Hall and L. B. Kier, "
*The Molecular Connectivity Chi Indices and Kappa Shape Indices in Structure-Property Modeling*", in Reviews of Computational Chemistry, Volume 2, D. B. Boyd and K. Lipkowitz, eds. (1991).

Below are the graphs for the structures corresponding to the maximum count of paths for orders 1, 2, and 3 in 6-atom molecules.

(a) (b) (c)

= r(x)/r[C(spwhere r(x) is the covalent radius of atom x and r[C(sp^{3})] - 1

In some small molecules, certain of the ^{m}P quantities may not be
defined or are considered to be zero. This presents problems in
applying the kappa algorithm. We consider one approach to solving
this problem. The calculation of a ^{1} value is possible for any
molecule except those represented by a single point, i.e., methane.
In this case, ^{1} = 0 from the general equation for kappa-1.
In general, for a straight chain molecule, ^{1} = A, and so an
extrapolated value of ^{1} = 1.000 is adopted for methane.

The calculation of ^{2} values leads to non-zero values in all cases
except for any graph representation of one or two atoms, such as
methane and ethane. In both cases, ^{2} from the general equation
is zero. In general, for straight chain molecules, ^{2} = A - 1.
Values of ^{2} = 1.000 for ethane and
^{2} = 0 for methane are extrapolated and proposed as useful in
cases where these molecules are part of a structure-activity analysis.

In the case of ^{3} values calculated from the general equations,
the values for methane, ethane, and propane, are zero while the value for butane is 4.000,
the same as for pentane. By linear extrapolation, more useful ^{3}
values for these molecules are derived: methane ^{3} = 0, ethane
^{3} = 1.450, propane ^{3}
= 2.000, and butane ^{3} = 3.378. These estimates are used to
provide numerical values for small molecule which may appear in a series.

- L. H. Hall and L. B. Kier,
*Quant. Struct.-Act. Relat.*,**9**, 115 (1990).

- L.H. Hall, "Computational Aspects of Molecular Connectivity
and its Role in Structure-Activity Modeling" in
*Computational Chemical Graph Theory*, D. H. Rouvray, Ed., Nova Press, (1989).

- L. H. Hall and L. B. Kier, "
*The Molecular Connectivity Chi Indexes and Kappa Shape Indexes in Structure-Property Modeling*", in Reviews of Computational Chemistry, Volume 2, D. B. Boyd and K. Lipkowitz, eds., (1991).

Each atom in the skeleton structure of a molecule is identified by the valence delta values described above. Beginning with any atom i, all contiguous paths of atoms, emanating from that atom to each other atom j, are identified. The lowest order path is the atom itself. This process is followed by finding all first order paths (bonds containing atom i) and so on ultimately including the longest paths(s) terminating on atom i. A numerical value is calculated for each of these paths and is an entry in the Topological State Matrix T. Each entry is calculated according to the formula

tGM_{ij}= (GM_{ij})^{a}(d_{ij})^{b}.

The default option value <4> for the Topological State Algorithm in the Molconn-Z program uses a = +1 and b = -2:

t_{ij}= (GM_{ij})^{a}(d_{ij})^{b}.

T | Topological State Index, T_{i} | ||||||||

1 | CH_{3}- | CH_{3}CH- | CH_{3}CHCH_{3} | CH_{3}CHOH | 1.000 | 1.154 | 2.080 | 1.216 | |

2 | -CH< | >CHCH_{3} | >CHOH | 1.154 | 0.333 | 1.154 | 0.516 | ||

3 | CH_{3} | CH_{3}CHOH | 2.080 | 1.154 | 1.000 | 1.216 | |||

4 | -OH | 1.216 | 0.516 | 1.216 | 0.200 |

T | Topological State Index, T_{i} | ||||||||

1 | CH_{3}- | CH_{3}CH- | CH_{3}CHCH_{3} | CH_{3}CHCH_{3} | 1.000 | 1.154 | 2.080 | 2.080 | |

2 | -CH< | >CHCH_{3} | >CHCH_{3} | 1.154 | 0.333 | 1.154 | 1.514 | ||

3 | CH_{3} | CH_{3}CHCH_{3} | 2.080 | 1.154 | 1.000 | 2.080 | |||

4 | -CH_{3} | 2.080 | 1.154 | 2.080 | 1.000 |

In these examples it can be observed that topological equivalence is indicated by
the T_{i} values. In 2-propanol the topological equivalence of the two methyl groups
is shown by the fact that T_{1} = T_{3}; no other values are equal in accordance with the
fact that no other atoms are topologically equivalent. In isobutane, three T_{i}
values are equal, T_{1} = T_{3} = T_{4}, in keeping with the fact that the three methyl
groups are equivalent and the central methine group is unique. In this sense,
the topological state index values represent the topological equivalence
(topological symmetry) of the molecule. The pattern of T_{i} values for a portion
of a molecule appears to be characteristic of that fragment and may be used as
a basis for quantitative measures of fragment similarity.

- L. B. Kier and L. H. Hall, "Molecular Structure Description: The Electrotopological State",
Academic Press (1999).

- L. B. Kier and L. H. Hall, in
*Advances in Drug Design*, Vol. 22, ed. B. Testa, Academic Press (1992). The Electrotopological State Index: An Atom-Centered Index for QSAR.

- L. B. Kier and L. H. Hall,
*Pharmaceutical Res.*,**7**, 801, (1990). An Electrotopological State Index for Atoms in Molecules.

- L. H. Hall, B. K. Mohney and L. B. Kier,
*J. Chem. Inf. Comput. Sci.*,**31**, 76, (1991). The Electrotopological State: Structure Information at the Atomic Level for Molecular Graphs.

- L. H. Hall, B. K.
Mohney and L. B. Kier,
*Quant. Struct.-Act. Relat.*,**10**, 43, (1991). The Electrotopological State: An Atom Index for QSAR.

- L. B. Kier and L. H. Hall,
*J. Math. Chem.*,**7**, 229 (1991). An Index of Electrotopological State of Atoms in Molecules.

- L. B. Kier and L. H. Hall, "An Index of Atom Electrotopological State",
in
*QSAR in the Design of Bioactive Compounds*", A telesymposium, A. Biaggi, ed., J. R. Prous Publishers, S.A. (1992).

- L. H. Hall, B. K. Mohney, and L. B. Kier, "Comparison of Electrotopological State
Indexes with Molecular Orbital Parameters: Inhibition of MAO by Hydrazides", Quant. Struct.-Act. Relat.,
**12**, 44-48 (1993).

I_{i} = [(2/N)^{2} ^{v} + 1]/

The intrinsic state encodes the
valence state electronegativity of the atom as well as its local topology through the use of the molecular
connectivity simple and valence delta values, and ^{v}.

The perturbation on atom i, arising from the presence
of all other atoms j, is a function of the difference between the intrinsic atoms: I_{i} - I_{j}. The perturbation is
diminished over distance; the functional dependence of the diminution is taken to be the square of the count of atoms
in the shortest path between atoms i and j (r_{ij}).
The perturbations are summed over the whole molecule:

I_{i} = (I_{i} - I_{j})/r_{ij}^{2}

The electrotopological state, called the E-state, of atom i, S_{i}, is given as the sum
of the intrinsic state and the perturbations:

S_{i} = I_{i} + I_{i}

The E-state index values S_{i} are output to the .S file if the appropriate records are
selected and the .S file is selected in the Menu.

An example calculation of the contributions to the electrotopological state indices is given for alanine.

I(1) = 2.000 | I(4) = 6.000 |

I(2) = 1.333 | I(5) = 4.000 |

I(3) = 1.667 | I(6) = 7.000 |

_{i} - I_{j})/r_{ij}^{2} Matrix | ||||||||

i | 1 | 2 | 3 | 4 | 5 | 6 | I_{i} = row sum | |

1 | 0.0 | 0.1667 | 0.0370 | -0.2500 | -0.2222 | -0.3125 | -0.5810 | |

2 | -0.1667 | 0.0 | -0.0833 | -0.5185 | -0.6667 | -0.6296 | -2.0648 | |

3 | 0.0370 | 0.0833 | 0.0 | -1.0833 | -0.2593 | -1.3333 | -2.6296 | |

4 | 0.2500 | 0.5185 | 1.0833 | 0.0 | 0.1250 | -0.1111 | 1.8657 | |

5 | 0.2222 | 0.6667 | 0.2593 | -0.1250 | 0.0 | -0.1875 | 0.8356 | |

6 | 0.3125 | 0.6296 | 1.3333 | 0.1111 | 0.1875 | 0.0 | 2.5741 | |

sum | 0.0000 |

The summary of calculated E-State Values for Alanine: | |

_{i} = I_{i} + I_{i} |

The electrotopological state indices have been used in developing QSAR relations for a variety of biological properties in addition to certain physicochemical properties.

- G. E. Kellogg, L. B. Kier, P. Gaillard and L. H. Hall, "The E-State Fields.
Applications to 3D QSAR",
*J. Comp. Aid. Molec. Des.***10**, 513-520, (1996).

The method described in the previous section applies to the each skeletal atom (together with its attached hydrogen atoms). It is useful, however, to develop an E-state index for the hydrogen atoms alone. This is especially useful for hydrogen atoms which are described as polar. For this reason we have adopted the same formalism as given above but with a somewhat different approach for the intrinsic state value. We have taken I(H) to be primarily dependent upon the attached atom and have used the following expression:

I(H) = (^{v} - )^{2}/

This expression gives rise to the values in the following table.

X-H | X(^{v} - ) | I(H) |

-OH | 4 | 16.0 |

=NH | 3 | 9.0 |

-NH_{2} | 2 | 4.0 |

=CH | 2 | 4.0 |

-NH- | 2 | 2.0 |

=CH_{2} | 1 | 1.0 |

=CH- | 1 | 1.0 |

-CH< | 0 | 0.0 |

-CH_{2}- | 0 | 0.0 |

-CH_{3} | 0 | 0.0 |

As an example of the hydrogen E-state index values, consider the following for 3,3-dimethyl-4,5-dichlorohexanol. Polar hydrgen atoms have very large positive values; less polar hyrogens have proportinally lsmaller values. Even hydrogen atoms located on carbon atoms near halogen atoms are shown to be influenced.

Atom ID | Group | Valence Delta | HES, S_{i} | Intrinsic State, I _{i} | Connected Atoms |

1 | CH_{3} | 1.00 | -1.382 | 0.000 | 2 |

2 | CH | 3.00 | 0.089 | 0.000 | 1 3 4 |

3 | Cl | 0.78 | 0.000 | 0.000 | 2 |

4 | CH | 3.00 | 0.073 | 0.000 | 2 5 6 |

5 | Cl | 0.78 | 0.000 | 0.000 | 4 |

6 | C | 4.00 | 0.000 | 0.000 | 4 7 8 9 |

7 | CH_{3} | 1.00 | -1.224 | 0.000 | 6 |

8 | CH_{3} | 1.00 | -1.224 | 0.000 | 6 |

9 | CH_{2} | 2.00 | -2.361 | 0.000 | 6 10 |

10 | CH_{2} | 2.00 | -4.374 | 0.000 | 9 11 |

11 | OH | 5.00 | 25.979 | 16.000 | 10 |

These values for I are used to compute an E-State index for each hydrogen atom in the molecule. These are stored in the .S file as HES.

The formalism for the hydrogen E-State is parallel to that for the E-State index of each atom (or
hydride group). For the hydrogen E-State the intrinsic state value is set to zero. The
perturbation term (HI_{ij}) depends on the valence
state electronegativity of the hydride group as given by the Kier-Hall (relative)
electronegativity: RKHE = (^{v} -
)/n^{2} where n is the principal quantum number
of the valence electrons of the atom. For the hydrogen E-State the perturbation term,
HI_{ij} = (RKHE_{i} -
RKHE_{j})/r_{ij}^{2} in which r_{ij} is the number of atoms
in the shortest path between atoms i and j. In addition to the perturbation
contribution for every other hydride group, an additional contribution arises from the bond
between the hydrogen atom and the atom to which it is bonded. Thus, for the hydrogen E-State
value on atom i, HS(i) = (_{j}
HI_{ij} + ((-0.2) - RKHE_{i}). The
electronegativity of hydrogen is taken as -0.2 (on a scale in which C(sp^{3}) is set
to zero). For simplicity of use the actual value computed by Molconn-Z is taken as the
negative of the value defined above to make all the values positive: HS(i) <-- -HS(i).

In the manner described here, Molconn-Z computes the hydrogen E-State values for each atom, a parallel set of values to the E-State values for each atom. The hydrogen E-State values are not highly correlated with the E-State values. The hydrogen E-State values are useful for characterizing both the polar and nonpolar portions of the structure. This new formalism repairs a bug in the earlier algorithm in which molecular symmetry was not always kept.

- L. H. Hall and L. B. Kier, "Electrotopological State
Indices for Atom Types: A Novel Combination of Electronic, Topological,
and Valence State Information",
*J. Chem. Inf. Comput. Sci.*,**35**, 1039-1045 (1995). - L. H. Hall, L. B. Kier and B. B. Brown, "Molecular Similarity
Based on Novel Atom Type Electrotopological State Indices",
*J. Chem. Inf. Comput. Sci.*,**35**, 1074-1080 (1995). - L. H. Hall and C. T. Story, "Boiling Point of a Set of Alkanes,
Alcohols and Chloroalkanes: QSAR with Atom Type Electrotopological
State Indices using Artificial Neural Networks",
*SAR and QSAR in Environ. Res.*(manuscript submitted).

For data sets with a wide variety of atom types, it is useful to have E-state indices for each atom type. Molconn-Z computes the E-state for each skeletal atom (with attached hydrogens) and then sums the values for all the atoms of a given atom type. These atom type E-state indices are then output to the .S file for use in analysis.

Two papers have been written and a third submitted which discuss the development and use of the atom type E-state indices. In the first paper the definitions are given along with an application to the boiling point of alkanes and alcohols. In the second paper the atom type E-state indices are used to define a molecular similarity measure and as a basis for database searching. In the third paper atom type E-state indices are used to model the boiling point of a set of 372 alkanes, alcohols and chloroalkanes using artificial neural networks.

The following table shows the results of a sample computation for the molecule 3,3-dimethyl-4,5-dichlorohexanol. Part of the Molconn-Z output in the .L file is given here to show the E-state indices. Following that table is the table of atom type E-state indices.

Atom ID | Group | Valence Delta | EState, S _{i} | Intrinsic State, I _{i} | Connected Atoms |

1 | CH_{3} | 1.00 | 1.87336 | 2.000 | 2 |

2 | CH | 3.00 | -0.05721 | 1.333 | 1 3 4 |

3 | Cl | 0.78 | 5.83490 | 4.111 | 2 |

4 | CH | 3.00 | -0.08674 | 1.333 | 2 5 6 |

5 | Cl | 0.78 | 6.04310 | 4.111 | 4 |

6 | C | 4.00 | -0.08584 | 1.250 | 4 7 8 9 |

7 | CH_{3} | 1.00 | 2.01366 | 2.000 | 6 |

8 | CH_{3} | 1.00 | 2.01366 | 2.000 | 6 |

9 | CH_{2} | 2.00 | 0.69269 | 1.500 | 6 10 |

10 | CH_{2} | 2.00 | 0.16650 | 1.500 | 9 11 |

11 | OH | 5.00 | 8.73082 | 6.000 | 10 |

A simple set of symbols was developed for the atom type E-state.
For the methyl groups the symbol is SsCH3.
For methylene it is SssCH2; for the terminal double-bonded CH_{2},
it is SdCH2 and for the keto oxygen, it is SdO. A set of the symbols
is given in Appendix III, Section A.
For the molecule given above, the atom type E-state indices are as follows:

Atom Type | E-state Value |

SsCH3 | 5.091 |

SssCH2 | 0.860 |

SsssCH | -0.087 |

SssssC | -0.086 |

SsOH | 8.731 |

SsCl | 11.878 |

HCsats: Carbon sp3 bonded to other saturated carbon atoms

HCsatu: Carbon sp3 bonded to unsaturated carbon atoms

HdsCH: Carbon atoms in the vinyl group, =CH-

HdCH2: Carbon atoms in the terminal vinyl group, =CH2

Havin: Carbon atoms in the vinyl group, =CH-, bonded to an aromatic carbon

HaaCH: Carbon sp2 which are part of an aromatic system

All of the above descriptors are found in record #31 in the .S file. In addition there is a general descriptor for non-polar hydrogen atoms which is the sum of the Hydrogen E-State values for all non-polar C-H bonds, Hother, also found in record #31 in the .S file.

Descriptors that represent the potential for internal hydrogen bonding is determined as follows: There is a donor and an acceptor separated by n bonds along a path, the donor is characterized by the Hydrogen E-State value, the acceptor is characterized by the E-State value, and the internal hydrogen bond descriptor, SHBintn (where n is path length), is computed as the product of the Hydrogen E-State value times the E-State value. The most likely occurrence of internal hydrogen bonding is for paths of length 3 and 4. The complete hydrogen-bonded internal ring also includes the X-H bond as well as the hydrogen bond. Thus, the ring size is n + 2. Currently descriptors are computed for values of n from 2 to 10 although only those for n = 3, 4 are considered most useful. Also the count of the corresponding occurrences are included as nHBintn. See Appendix I for specific output location for these descriptors.

I_{ij} = (I_{i} + I_{j})/2

BES_{ij} = I_{ij} + I_{ij}/__r _{ij}__

where __r _{ij}__ is computed as the average r

These computed values for individual bonds are then collected for each type of bond in the molecule. Appendix III lists the current set of bond types, numbering 895 at present.

The symbol for each bond type consists of an indicator for bond order,
e1, e2, e3 or ea. The two atom types are encoded next. Finally, if necessary,
an indication of unused bonds is given. For the bond between -CH_{3}> and -CH_{2}-,
the symbol is e1C1C2. For the bond between =CH- and -CH_{2}-, it is e1C2C2d.
For the bond between =CH- and =CH-, there are two possibilities: e1C2C2d
and e2C2C2s. The list of bond types and their symbols is given in
Appendix III.

For the molecule 3,3-dimethyl-4,5-dichlorohexanol, the following tables of values are given in the Molconn-Z .L file.

*--> Bond E-State indices from bond types

Bond Atom IDs Atom Types Bond Type Bond Bond No. i j i j Order No. E-State Symbol ---- ---------- --------------- ----- ---- ------- ------ 1 1 2 -CH3 >CH- 1 69 1.383 e1C1C3 2 2 3 >CH- -Cl 1 290 3.734 e1C3Cl 3 2 4 >CH- >CH- 1 264 .389 e1C3C3 4 4 5 >CH- -Cl 1 290 3.973 e1C3Cl 5 4 6 >CH- >C< 1 268 .359 e1C3C4 6 6 7 >C< -CH3 1 73 1.491 e1C1C4 7 6 8 >C< -CH3 1 73 1.491 e1C1C4 8 6 9 >C< -CH2- 1 138 .728 e2C2C4 9 9 10 -CH2- -CH2- 1 132 .843 e1C2C2 10 10 11 -CH2- -OH 1 148 5.220 e1C2O1

*--> Bond Type Indices:

NET ETS Bond Type No. Bond Count Bond E-State Symbol --- ---------- ------------ --------- 69 1 1.38273 e1C1C3 73 2 2.98245 e1C1C4 132 1 .84259 e1C2C2 138 1 .72820 e2C2C4 148 1 5.21986 e1C2O1 264 1 .38944 e1C3C3 268 1 .35870 e1C3C4 290 2 7.70714 e1C3Cl

These indices have not yet been described in the literature. The bond type E-state indices are output to the .E file by Molconn-Z. Two catagories may be selected, inorganic or organic, in the Menu.

For molecular connectivity indices, when a data set consists of diverse molecular structures, the collinearity is usually not large enough to pose a significant problem. However, when the set of molecules covers a large range in molecular size and/or there is not a high degree of structure diversity, collinearity can become a problem in various forms of statistical analysis.

There are well known methods for introducing orthogonality into a set of variables. Principal component
analysis (PCA) is often used. In this method, the set of variables, x_{i}, is transformed
by linear combinations into a new set, called x_{i}'. The new variables are obtained
as the eigenvectors of either the variance/covariance matrix or the correlation matrix of the
original variables. In this rather straightforward manner, each new principal component is a
linear combination of all the original x_{i} variables. Such an arrangement facilitates
subsequent statistical analysis but interpretation is obscured because each principal component
is a linear combination of every variable in the data set.

A second method is to use a Gram-Schmidt type orthogonalization (GSO). For a set of variables
(x_{}, i=1,n), x_{1} is selected as the first new variable,
x_{1}'. x_{2}' is created to be orthogonal to x_{1}'. A simple way
to obtain x_{}' is to regress x_{2} against x_{1}' (=x_{1}).
The set of residuals from this regression, res_{2}, is set equal to x_{2}'.
Because res_{2} is that part of x_{2} which is not correlated to x_{1},
that is, res_{2} (=x_{2}') orthogonal to x_{1}. One can
proceed in this manner to create a set of orthogonal indices, x_{i}'. This method
is a bit more cumbersome than using principal components. The first two or three of these
x_{i}' indices may also have more interpretability than the principal components.
However, ultimately, interpretability is largely lost. Further, in both of these methods,
the set of orthogonal indices is dependent upon the specific data set from which they were
generated. Adding a few new observations tends to diminish the orthogonality as well as the
interpretation given to the orthogonal indices.

Another method which we have been using approaches the problem somewhat differently. A major aspect of the collinearity problem is the significant contribution of molecular size to the constitution of the chi indices. In fact, when the data set spans a large size range, the collinearity tends to increase significantly. For this reason, we have chosen to create a largely, but not exactly, orthogonal index which is largely independent of molecular size. After size is removed, the salient aspects of structure encoded in the chi indices remains. These important aspects - branching, cyclization, heteroatom content, heteroatom position - are emphasized in these new indices.

For each order of chi path index for a given molecule, we created an index to encode size only
for each order of path. This quantity is defined for the graph of the same size as the given
molecule but for the unbranched, acyclic graph: ^{m}_{N}
for the simple path index and mcNv for the valence path index. The new, largely size-independent
chi index of order m is called the **difference chi index**. The difference chi
index is defined for both the simple path index and the valence path index of order m.

d^{m}_{N} = ^{m}_{P}
- ^{m}_{N}, for simple path indices

d^{m}_{N}^{v} =
^{m}_{P}^{v} -
^{m}_{N}^{v}, for valence path indices

In this manner, the size factor in the chi path index is subtracted out, leaving an index which is very nearly independent of size. For alkanes, the relation is exact. However, for heteroatom-containing indices, the relation is not exact. The size-independence of each difference index depends somewhat on the specific set of molecules. However, the definition of the difference chi index is independent of the data set. The difference chi indices are output to the .S file if it is selected in the Menu and if the appropriate records are selected for the .S file.

The difference chi indices are illustrated in the following table.

for Path-1 and Path-3

Graph^{1}_{N}d^{1}^{1}_{N}^{2}d^{1}^{v}^{3}_{N}d^{3}_{P}^{3}_{N}^{v}d^{3}_{P}^{v}------------ ---- ---- ----- ----- ---- ---- ---- -----

It is possible, then, to remove that part of each chi index which encodes only the sigma
electrons and leave that part which encodes the pi and lone pair electrons. This process simply involves subtraction
of the simple index from the corresponding valence index of the same order. These indices are called the **delta
chi indices** and are defined as follows.

^{m}_{t} = ^{m}_{t}^{v} - ^{m}_{t}

where t stands for the types of chi indices: path (P), cluster (C), path-cluster (PC), and chain (CH).

The following table gives examples for the delta chi indices for four molecules and the
three indices: ^{1}_{t}, ^{3}_{P}, ^{4}_{PC}.

Graph^{1}_{t}^{3}_{P}^{4}_{PC}--------------- ----- ----- -----

- L. H. Hall and L. B. Kier, "The Molecular Connectivity Chi Indexes and Kappa Shape Indexes in Structure-Property
Relations", in
*Reviews of Computational Chemistry*, Volume 2, Chap 9, pp 367-422, Donald Boyd and Ken Lipkowitz, eds., VCH Publishers, Inc. (1991). - L. B. Kier and L. H. Hall, "Differential Molecular Connectivity in Database Fragment Searching", Pharm. Res.,
**6**, 497 (1989). - L. B. Kier and L. H. Hall, "A Differential Molecular Connectivity Index",
*Quant. Struct.-Act. Relat.*,**10**, 134 (1991).

The delta chi indices are not included in the .S output file. The user can easily produce the delta chi indices in the processing of the .S file in conjunction with the statistical analysis of the data set.

It should be pointed out that it is also possible to construct the sum of a simple chi index and its valence counterpart:

A given **sum chi index** is, in general, orthogonal to the delta chi index of the same order.
Thus, it is possible to reduce collinearity in the data set by converting from the chi indices to the set of
delta and sum chi indices.

Petitjean also defined a graph shape index: I = (D - R)/D. This index characterizes the shape of various graphs. For any graph with all equivalent vertices, D = R and I = 0. Hence, a purely monocyclic graph has I = 0. For an acyclic graph, either D is even and D = 2R or D is odd and D = 2R - 1. For n-hexane R = 3 and D = 5 and I = 2/3; for n-heptane R = 3 and D = 6 and I = 3/6 = 1/2.

The vertex eccentricities are output to the .S file if it is selected in the Menu and if the appropriate records are selected for the .S file.

- M. Petitjean, "Applications of the Radius-Diameter Diagram to the Classification of Topological and Geometrical
Shapes of Chemical Compounds",
*J. Chem. Inf. Comput. Sci.*,**32**, 331-337 (1992).

A major step towards this goal has been recently published. A target
property or activity value together with the QSAR equations based on chi and/or kappa indices yields target values
for ^{1}, ^{2} and
^{3}_{P} indices. Using these low order molecular connectivity indices,
we have developed a formal scheme for conversion of the chi indices into path counts and subsequently
into counts of atoms types (vertex degrees). These atom types can be assembled directly into molecular
structures.

chi indices --> path counts --> vertex degrees (atom types) --> graphs

The initial papers in the series deal with the molecular skeletons using only the simple chi indices. Incorporation of heteroatoms and bonding schemes are added from the experience of the user and the nature of the data sets and the property of interest. Current work is aimed at a broader scheme which will include heteroatoms and their placement directly.

The equations which related the basic graph quantities have been derived in a rigorous manner. These relating equations contain path counts, vertex degree counts and the number of rings:

^{1}D = - ^{4}D + ^{2}p - ^{1}p + 3 -3R

^{2}D = 3^{4}D - 2^{2}p + 3^{1}p - 3 +3R

^{3}D = -3^{4}D + ^{2}p - ^{1}p + 1 -R

where ^{i}D is the number of vertices of degree i; ^{i}p is the number of paths of length i edges;
and R is the number of rings in the graph. See references 1 and 2 below for the details of the process.
From the set of vertex degrees, Faradzhev has shown that it is possible to construct all the graphs
consistent with that set of vertex degrees. The references below give detailed examples of the overall
scheme and each of each steps.

The inverse QSAR process is not part of the Molconn software. However, the necessary chi indices along with the path counts (nxpi) and vertex degree counts (ndi) are all computed by Molconn-Z. See Appendix I for the list of variables in the .S file.

- L. B. Kier, L. H. Hall and J. W. Frazer, "Design of Molecules from Quantitative
Structure-Activity Relationship Models. 1. Information Transfer between Path and Vertex Degree
Counts",
*J. Chem. Inf. Comput. Sci.*,**33**, 143 (1992). - L. H. Hall, L. B. Kier and J. W. Frazer, "Design of Molecules from Quantitative
Structure-Activity Relationship Models. 2. Derivation and Proof of Information Transfer Relating
Equations",
*J. Chem. Inf. Comput. Sci.*,**33**, 143 (1992). - L. H. Hall and L. B. Kier, "Design of Molecules from Quantitative Structure-Activity Relationship Models. 3. Role of Higher Order Path Counts: Path Three", J. Chem.
Inf. Comput. Sci.,
*33*, 598-603 (1993). - L. B. Kier and L. H. Hall, "The Generation of Molecular
Structures from a Graph-Based Equation",
*Quant. Struct.-Act. Relat.*,**12**, 383-388 (1993).

For our purposes here we will give the overall formal scheme as illustrated by one simple example taken from our first paper on the inverse QSAR investigation.

QSAR Equations | Target Property Value | |
---|---|---|

a. | mv = 42.87 + 27.43 * ^{1} + 7.69 * ^{2} - 3.50 * ^{3}_{ } | mv = 158. to 162. |

b. | mv = 44.40 + 25.49 * ^{1} + 7.49 * ^{2} - 0.058 * ^{4}_{P} | |

c. | mv = 48.09 + 23.82 * ^{1} + 8.03 * ^{2} - 9.70 * ^{6}_{P} | |

d. | mv = 40.50 + 35.72 * ^{1} - 4.64 * ^{3}_{P} - 17.72 * ^{6}_{P} | |

e. | mv = 47.67 + 31.14 * ^{1} + 0.675 * ^{4}_{P} - 7.22 * ^{6}_{P} |

^{1} = 3.498 to 3.616

^{2} = 3.275 to 3.413

^{1}p = 7, ^{2}p = 8, 9, 10

A = 8, R = 0

1D | 2D | 3D | 4D | |
---|---|---|---|---|

1. | 4 | 2 | 2 | 0 |

2. | 5 | 0 | 3 | 0 |

3. | 4 | 3 | 0 | 1 |

4. | 5 | 1 | 1 | 1 |