Molconn-Z 4.00 Manual: Chapter 3S

LESSON 3: Using Standard Molconn-Z Descriptors in the Sybyl Molecular Spreadsheet

This lesson assumes that you have just initiated SYBYL. If you are actually already in the custom version of SYBYL you should Zap all existing molecules and Delete all Backgrounds before beginning this lesson. Here we will open a Sybyl molecular database of halocarbons and model their anaesthetic potency and toxicity with a large set of standard (1D) QSAR descriptors.

Open a SYBYL Molecular Spreadsheet and Database

From the File pulldown on the menubar select Molecular Spreadsheet and New.... The rows will represent Molecules. In the DATABASE_FILE dialog box, choose halocarbons.mdb and press Open. After the spreadsheet is initialized and appears, we will import the biological data for columns 1 and 2. From the File menu on the Molecular Spreadsheet choose the Import... item. On the resulting Import dialog choose Format: Tripos and enter ad50ld50.tripos in the File: text area. Press Import to load the first 2 columns of the MSS with the halocarbon anaesthtic potency and toxicity data.

Add QSAR Descriptors to the Sybyl Spreadsheet

From the spreadsheet menubar select the AutoFill button (and choose a new Column). Select MCONNIDX as the New column type to call the. MolconnZ MSS Dialog box. There are over 300 descriptors available, but we will choose a subset to reduce wasted computational effort and to make it easier to interpret the results:

Turn on "Simple/Valence/Cluster/Difference chi path indexes" and press the selection ... button. In the MolconnZ MSS (Chi Indices) Dialog press the All buttons under Valence and Path-Cluster to select these types of Chi indices. Press OK.

Turn on "Kappa and related indices" and press the selection ... button. In the MolconnZ MSS (Kappa Indices) Dialog verify that ka1, ka2, ka3 and phia are selected and press OK.

Turn off "Counts and Complexity indices". (All of these molecules are of similar size and complexity.)

Turn off "Topological State, Shape Wiener and Shannon indices". (These molecules have generally similar shape.)

Turn on "Hydrogen Bond-related Counts and EState indices" and press the selection ... button to call the MolconnZ MSS (H-Bond Descriptors, E-States and Counts) Dialog. Hydrogens beta to halogens are potential weak hydrogen bond donors, so select SwHBd. Also select Hmax, Gmax, Hmin and Gmin. Deselect and HBint descriptors and press OK.

Turn off "Vertex and Edge-type counts". (These are not often useful in normal QSAR analyses.)

Turn off "Atom-type Counts". (Not really necessary in this analysis.)

Turn on "Atom-type Sum E-state indices" and press the selection ... button to call the MolconnZ Select Types Dialog. This dialog lists all of the Molconn-Z atom types; if some of them are already selected press the All button twice to clear. We will choose only the types that appear in our data set and leave the others off, although this is not strictly necessary as the PLS routine will certainly ignore any descriptor that is consistently zero. The following types are likely relevent:

block 1:  HCHnX   HCsats   Hother

block 3:  sCH3   ssCH2  sssCH   ssssC

block 5:  sF

block 6:  sCl

block 7:  sBr    sI

Press OK to select these Atom-type E-State descriptors.

Turn off "Group-type Counts" and "Group-type Sum E-state indices". (No functional groups in these molecules.)

Note that if the check box button on the Main MSS Dialog is not on, it doesn't matter which descriptors may or may not be activated under the associated category.

The parameters (under) the Algorithm Options... button on the MSS Dialog are fine as is, so press OK to begin filling the Spreadsheet. Occasionally a Sybyl dialog will appear asking if the Fill method is Cell or Column -- use Column, although it doesn't seem to matter. When Molconn-Z has finished there should be 44 (total) columns in the MSS.

Run PLS on the Data

From the QSAR pulldown, select Partial Least Squares... to call the Partial Least Squares Analysis dialog box. The Dependent Column should be 1, and Column 2 should be deleted from the List of Columns to Use as it is measured biological data for a parallel analysis. Select Leave-1-Out Validation, Use SAMPLS: off, Column Filtering: off, 10 Components, and Scaling: Autoscale. This run will take less than 1 minute, so you may run it either interactively or in batch. If you run in batch, get a report on the results with QSAR, Report QSAR...; enter qsar.lis as the File name to receive the QSAR report. To review the report, spawn or create a unix window and edit or list the report. If you run it interactively, be sure to choose Yes for Keep this analysis? In this run the optimum number of components is reported as 10 (see below) and the cross-validated r² is 0.939.

Summary output
Standard Error of Predictions (Crossvalidated)  
Run #        Comp1     Comp2     Comp3     Comp4     Comp5     Comp6     Comp7  
--- -      --------- --------- --------- --------- --------- --------- ---------
  1 1 AD50     0.795     0.460     0.442     0.443     0.425     0.409     0.376
Run #        Comp8     Comp9    Comp10  
--- -      --------- --------- ---------
  1 1 AD50     0.381     0.383     0.386
Optimum # of components is 10.
R squared                         0.926

Examination of the output reveals that Sybyl has had a very hard time in defining the Optimum number of components. Clearly Comp2 is really not that much different in standard error as Comp10, so it could have been selected. The generally accepted approach is to accept as the optimum number of components a point where adding additional components gives little (ca. 2-5%) improvement, so reporting the best model as having two or three components is quite reasonable.

Re-run the PLS analysis with No Validation and 4 Components. This analysis produces the conventional r² of 0.916.

Standard Error of Estimate           0.372
R squared                         0.916
F values     ( n1= 3, n2=38 )   137.489
Prob.of R2=0 ( n1= 3, n2=38 )     0.000
Regression Equation(s)
AD50  =   8.589 - (0.161) * Xv0_3 - (0.412) * Xv1_4 - (0.151) * Xv2_5

      - (0.293) * Xvp3_6 - (0.482) * Xvp4_7 - (0.025) * Xvc3_14

      + (0.658) * Xvc4_15 - (0.117) * Xvpc4_16 - (0.013) * ka1_25

      - (0.535) * ka2_26 - (0.086) * ka3_27 - (0.573) * phia_28

      + (0.026) * SwHBd_29 + (0.026) * Hmax_30 + (0.027) * Hmin_32

      + (0.014) * Gmin_33 - (1.454) * SHCHnX_34 - (0.131) * SHother_36

      + (0.066) * SsCH3_37 - (0.210) * SssssC_40 + (0.011) * SsF_41

      - (0.190) * SsBr_43 - (0.275) * SsI_44

Summary output
Standard Error of Estimate           0.372
R squared                         0.916
F values     ( n1= 3, n2=38 )   137.489
Prob.of R2=0 ( n1= 3, n2=38 )     0.000

Interpretation of Results

A closer look at the above results suggests that some of the descriptors may be contributing little more than noise to the QSAR model. Those that have coefficients less than around 0.01 are certainly suspect, and those with coefficients less than ca. 0.05 are probably not contributing much either. Try some experiments with reduced descriptor sets to see if you can come up with a better model with fewer descriptors. Automated Variable Selection is an ongoing area of research in many groups. Note that some of the descriptors are highly correlated with one another. Caution!! You can not delete Molconn-Z columns from this table! The methodology used to import them (the SPL column type) links them such that if one is deleted, they all will be deleted.