LESSON 3: Using Standard Molconn-Z Descriptors in the Sybyl Molecular Spreadsheet

This lesson assumes that you have just initiated SYBYL. If you are actually already in the custom version of SYBYL you should Zap all existing molecules and Delete all Backgrounds before beginning this lesson. Here we will open a Sybyl molecular database of halocarbons and model their anaesthetic potency and toxicity with a large set of standard (1D) QSAR descriptors.

  1. Open a SYBYL Molecular Spreadsheet and Database
  2. From the File pulldown on the menubar select Molecular Spreadsheet and New.... The rows will represent Molecules. In the DATABASE_FILE dialog box, choose halocarbons.mdb and press Open. After the spreadsheet is initialized and appears, we will import the biological data for columns 1 and 2. From the File menu on the Molecular Spreadsheet choose the Import... item. On the resulting Import dialog choose Format: Tripos and enter ad50ld50.tripos in the File: text area. Press Import to load the first 2 columns of the MSS with the halocarbon anaesthtic potency and toxicity data.

  3. Add QSAR Descriptors to the Sybyl Spreadsheet
  4. From the spreadsheet menubar select the AutoFill button (and choose a new Column). Select MCONNIDX as the New column type to call the. MolconnZ MSS Dialog box. There are over 300 descriptors available, but we will choose a subset to reduce wasted computational effort and to make it easier to interpret the results:

    Note that if the check box button on the Main MSS Dialog is not on, it doesn't matter which descriptors may or may not be activated under the associated category.

    The parameters (under) the Algorithm Options... button on the MSS Dialog are fine as is, so press OK to begin filling the Spreadsheet. Occasionally a Sybyl dialog will appear asking if the Fill method is Cell or Column -- use Column, although it doesn't seem to matter. When Molconn-Z has finished there should be 44 (total) columns in the MSS.

  5. Run PLS on the Data
  6. From the QSAR pulldown, select Partial Least Squares... to call the Partial Least Squares Analysis dialog box. The Dependent Column should be 1, and Column 2 should be deleted from the List of Columns to Use as it is measured biological data for a parallel analysis. Select Leave-1-Out Validation, Use SAMPLS: off, Column Filtering: off, 10 Components, and Scaling: Autoscale. This run will take less than 1 minute, so you may run it either interactively or in batch. If you run in batch, get a report on the results with QSAR, Report QSAR...; enter qsar.lis as the File name to receive the QSAR report. To review the report, spawn or create a unix window and edit or list the report. If you run it interactively, be sure to choose Yes for Keep this analysis? In this run the optimum number of components is reported as 10 (see below) and the cross-validated r2 is 0.939.

    Summary output
    Standard Error of Predictions (Crossvalidated)  
    Run #        Comp1     Comp2     Comp3     Comp4     Comp5     Comp6     Comp7  
    --- -      --------- --------- --------- --------- --------- --------- ---------
      1 1 AD50     0.795     0.460     0.442     0.443     0.425     0.409     0.376
    Run #        Comp8     Comp9    Comp10  
    --- -      --------- --------- ---------
      1 1 AD50     0.381     0.383     0.386
    Optimum # of components is 10.
    R squared                         0.926
    

    Examination of the output reveals that Sybyl has had a very hard time in defining the Optimum number of components. Clearly Comp2 is really not that much different in standard error as Comp10, so it could have been selected. The generally accepted approach is to accept as the optimum number of components a point where adding additional components gives little (ca. 2-5%) improvement, so reporting the best model as having two or three components is quite reasonable.

    Re-run the PLS analysis with No Validation and 4 Components. This analysis produces the conventional r2 of 0.916.

    Standard Error of Estimate           0.372
    R squared                         0.916
    F values     ( n1= 3, n2=38 )   137.489
    Prob.of R2=0 ( n1= 3, n2=38 )     0.000
    Regression Equation(s)
    AD50  =   8.589 - (0.161) * Xv0_3 - (0.412) * Xv1_4 - (0.151) * Xv2_5
    
          - (0.293) * Xvp3_6 - (0.482) * Xvp4_7 - (0.025) * Xvc3_14
    
          + (0.658) * Xvc4_15 - (0.117) * Xvpc4_16 - (0.013) * ka1_25
    
          - (0.535) * ka2_26 - (0.086) * ka3_27 - (0.573) * phia_28
    
          + (0.026) * SwHBd_29 + (0.026) * Hmax_30 + (0.027) * Hmin_32
    
          + (0.014) * Gmin_33 - (1.454) * SHCHnX_34 - (0.131) * SHother_36
    
          + (0.066) * SsCH3_37 - (0.210) * SssssC_40 + (0.011) * SsF_41
    
          - (0.190) * SsBr_43 - (0.275) * SsI_44
    
    Summary output
    Standard Error of Estimate           0.372
    R squared                         0.916
    F values     ( n1= 3, n2=38 )   137.489
    Prob.of R2=0 ( n1= 3, n2=38 )     0.000
    
  7. Interpretation of Results
  8. A closer look at the above results suggests that some of the descriptors may be contributing little more than noise to the QSAR model. Those that have coefficients less than around 0.01 are certainly suspect, and those with coefficients less than ca. 0.05 are probably not contributing much either. Try some experiments with reduced descriptor sets to see if you can come up with a better model with fewer descriptors. Automated Variable Selection is an ongoing area of research in many groups. Note that some of the descriptors are highly correlated with one another. Caution!! You can not delete Molconn-Z columns from this table! The methodology used to import them (the SPL column type) links them such that if one is deleted, they all will be deleted.