CHAPTER 4 Using Molconn-Z

General Program Description

Molconn-Z is a file-oriented program with a terminal command-line interface. The program requires two input files (a molecular structure file and a control file) and it produces an output file whose structure is defined in the control file.. The program is executed on the command line of a character terminal (DOS window in Windows version) and has the following syntax:

molconnz <name of control file> <name of molecular structure file> <name of output file>

for example:

molconnz demo_control.dat demo.sdf demo.sdf.out

CONTROL FILE

The control file contains keywords which define the molecular structure file format, the algorithms used, the information output to the output file, etc. The keywords are described in the table below. An example of this file is shown here.

NORECORDS 
ADD 1 
ADD 3 
ADD 4 
ADD 8 
ADD 9 
HEADERS top 
INPUTFORMAT oelibsmiles 
WARNINGS off 
INDEX on 
ERRORS skip 
GO

Each control file must contain a "GO" at the end of all other options. In this control file we only want to view the descriptors associated with records 1, 3, 4, 6, and 9, thus keywords NORECORDS plus ADD for each record number. These are the simple and valence Chi indices (see Appendix I). Other options are ALLRECORDS (which selects all records) and USEFULRECORDS (which selects a list of records that have been determined by experience to be generally useful). A new option with 4.09 is the keyword CSVUSEFUL, which selects the USEFUL record set and prints a single line for each compound with commas separating the descriptor data. New with 4.10 the group-type H E-State descriptors (35) are added to CSVUSEFUL. Also, we want to know the name of the each descriptor in a HEADER at the "top" of the file. Other options are to print HEADERS for "all" records, or turn them "off"; and to write (to stderr) an INDEX number for each processed molecule. Our molecular structure file will be in the SMILES format, but read by the OELIB code. The 5 options for INPUTFORMAT are "daylightsmiles", "oelibsmiles", "oelibsdf", "oelibmol" and "oelibmol2". WARNINGS are turned off, and ERRORS will cause the code to skip the calculations on that specific structure, and procede to the next. Since a common source for error is "disconnected graph" due to salt structures, a new option to ERRORS is added with 4.10, "fix", which will analyse the separate components of the salt. The ouput will then contain additional molecule numbers, for example 3_A and 3_B.

There are 4 calculation parameters that can be set using keywords:MAXORDER for the longest chain length, CTYPE for the type of formalism used in the connectivity calculations, STYPE for the type of formalism used in the topological calculations, and STATEFUNCTION for the distance function used in the topological calculations (see Chapter 2 for further details).

Table 3: Control File KEYWORDS

keyword options function

ALLRECORDS n/a set all record selectors TRUE

NORECORDS n/a set all record selectors FALSE

USEFULRECORDS* n/a set the record selector for "useful" records TRUE (all others FALSE)

CSVUSEFUL n/a set the record selector for "useful" records TRUE (all others FALSE) and print a single line with data separated by commas

ADD n (an integer record selector) set identified record TRUE

REMOVE n (an integer record selector) set identified record FALSE

HEADERS off* do not write a header/identifier in output

top write a header/identifier at top of output

all write a header/identifier in output for each molecule

MAXORDER n (an integer)
99* longest search path length for paths

CTYPE 0* use reciprocal-squareroot formalism for connectivity calculations

1 use geometric mean formalism for connectivity calculations

STYPE 0 use reciprocal-squareroot formalism for topological calculations

1* use geometric mean formalism for topological calculations

STATEFUNCTION 1 use distance/geometric mean topological function

2 use geometric mean/distance topological function

3 use distance**2/geometric mean topological function

4* use geometric mean/distance**2 topological function

5 use distance**3/geometric mean topological function

6 use geometric mean/distance**3 topological function

INPUTFORMAT daylightsmiles use (licensed) Daylight toolkit to interpret molecule structure

oelibsmiles use OElib SMILES routine to interpret molecule structure

oelibsdf use OElib SDF (MDL) routine to interpret molecule structure

oelibmol use OElib MOL (MDL) routine to interpret molecule structure

oelibmol2 use OElib MOL2 (Sybyl) routine to interpret molecule structure

WARNINGS on* write non-serious warning messages to stderr

off do not write non-serious warning messages to stderr

ERRORS exit exit on serious error

skip* skip current calculation, move to next molecule on serious error

continue continue current calculation, even with compromised input data

fix fix "disconnected graph" which occurs with salts, by evaluating each part of the salt

INDEX on* write index number for each processed molecule to stderr

off do not write index number for each processed molecule to stderr

GO n/a end of options input, begin calculations

keyword	options	function
ALLRECORDS	n/a	set all record selectors TRUE
NORECORDS	n/a	set all record selectors FALSE
USEFULRECORDS*	n/a	set the record selector for "useful" records TRUE (all others FALSE)
CSVUSEFUL	n/a	set the record selector for "useful" records TRUE (all others FALSE) and print a single line with data separated by commas
ADD	n (an integer record selector)	set identified record TRUE
REMOVE	n (an integer record selector)	set identified record FALSE
HEADERS	off*	do not write a header/identifier in output
top	write a header/identifier at top of output
all	write a header/identifier in output for each molecule
MAXORDER	n (an integer) 99*	longest search path length for paths
CTYPE	0*	use reciprocal-squareroot formalism for connectivity calculations
1	use geometric mean formalism for connectivity calculations
STYPE	0	use reciprocal-squareroot formalism for topological calculations
1*	use geometric mean formalism for topological calculations
STATEFUNCTION	1	use distance/geometric mean topological function
2	use geometric mean/distance topological function
3	use distance**2/geometric mean topological function
4*	use geometric mean/distance**2 topological function
5	use distance**3/geometric mean topological function
6	use geometric mean/distance**3 topological function
INPUTFORMAT	daylightsmiles	use (licensed) Daylight toolkit to interpret molecule structure
oelibsmiles	use OElib SMILES routine to interpret molecule structure
oelibsdf	use OElib SDF (MDL) routine to interpret molecule structure
oelibmol	use OElib MOL (MDL) routine to interpret molecule structure
oelibmol2	use OElib MOL2 (Sybyl) routine to interpret molecule structure
WARNINGS	on*	write non-serious warning messages to stderr
off	do not write non-serious warning messages to stderr
ERRORS	exit	exit on serious error
skip*	skip current calculation, move to next molecule on serious error
continue	continue current calculation, even with compromised input data
fix	fix "disconnected graph" which occurs with salts, by evaluating each part of the salt
INDEX	on*	write index number for each processed molecule to stderr
off	do not write index number for each processed molecule to stderr
GO	n/a	end of options input, begin calculations

* - default settings for program parameters.

MOLECULAR STRUCTURE FILE

The program expects an input molecular structure file which can be in one of three formats. The formats are described in more detail in Chapter 5

Daylight SMILES format read by OElib code from

OpenEye

(this does not require any additional code or license).

Daylight SMILES format read by Daylight SMILES Toolkit (this requires a run-time "smiles" license from Daylight).

MDL SDFile format read by OElib code from OpenEye (this does not require any additional code or license).

Tripos Sybyl/MOL2 format read by OElib code from OpenEye (this does not require any additional code or license).

In a typical application the user would include in the molecular structure file all the molecules which are a part of an investigation. Thus, the input molecular structure file can contain one or many structures. Other molecules may, of course, be added later or done separately. It is critical that the keyword INPUTFORMAT match the file format that is provided for input molecular structure file that is provided in the argument. That is, if the keyword INPUTFORMAT is set to "oelibsdf" then no matter what the name of the file is, it must be an SDF format file.

OUTPUT FILE

The structure of the output file depends on the keywords used in the control file. For example, use of the the keywords USEFULRECORDS and HEADERS "top" will provide output for most interesting descriptors with a listing of the descriptor names. Note that not all molecules may have a value for a particular descriptor, in which case it will be 0.0. One should be careful using the keywords ALLRECORDS and HEADERS "all" as this could produce a very large output file if your database of molecules is large. The output file will look distintively different if the keyword CSVUSEFUL is used. In this case all descriptor values for each compound will be in a single line separated only by commas. This type of output may help read the data into certain data analysis packages (like OpenOffice or EXCEL).

There is a slight difference in the record #1 in the CSVUSEFUL versus a non-CSV output. The CSVUSEFUL output will include an error code number as the second value (after the first comma). In a normal run this value will be "0". If there is a problem with the run, the value of this field will not be "0" and can be determined from the following table (see Appendix IV for more Error infomation):

ESLC_MOLCONNZ_ERROR_CIRCUITBUFFEROVERFLOW 1
ESLC_MOLCONNZ_ERROR_PATHBUFFEROVERFLOW 2
ESLC_MOLCONNZ_ERROR_DISCONNECTEDGRAPH 3
ESLC_MOLCONNZ_ERROR_EDGESOVERFLOW 4
ESLC_MOLCONNZ_ERROR_MONOATOMIC_MOLECULE 5
ESLC_MOLCONNZ_ERROR_UNPARAMETERIZED_ELEMENT 6
ESLC_MOLCONNZ_ERROR_FILE_READ_FAILURE 90

When an error is encountered, the line of output will be substantially reduced. In the case of CSVUSEFUL, there will only be 4 values (the second being non-zero), and in the case of printing RECORD 1 (not using CSVUSEFUL) there will be 12 entries, the last of which will be "not_available". Since RECORD 1 does not contain a field for errorcode, the errorcode (see table above) is recorded in the nclass field of RECORD 1 only when there is an error. Otherwise the value in nclass will be nclass, or the number of classes of topologically (symmetry) equivalent graph vertices. Also the order of the descriptors is slightly different as shown here:


CSVUSEFUL 
molnumber,errorcode,molname,nvx,nedges,nrings,ncircuits,nclass,nelem,ntpaths,molweight 
 
NORECORDS 
ADD 1 
moleculenumber narecs nvx nedges nrings ncircuits nclass nelem ntpaths molweight molname formula

Typically, an analysis with Molconn-Z might involve evaluating all records on a few representative molecules, followed by selecting a limited number of descriptors to evaluate a large database of molecules. This limited number of descriptors will probably paired down even further to something less than 10. The descriptors are output in the order defined by the records (see Appendix I) and in the order of the molecules listed in the input structure file.

For the cases where a large database is used, and INDEX is set to "on", it may be desirable to save this indexing information to a file so that you may determine which molecules in the database have problems that may need correcting. This information is printed to the "stderr" port and therfore it can be collected in a separate file using the following command:

UNIX/LINUX: $MCONN_RUN/molconnz control.dat database1.smi database1.s >& molconnz.log

Windows 2000/XP: MOLCONNZ control.dat database1.smi database1.s 2> molconnz.log

You should note that the LICENSE errors (which are the most common) are also printed to the "stderr" port, so anytime your output file is missing information, you should check the terminal or the stderr log file.

Typical Molconn-Z Session

The following steps are generally followed in using Molconn-Z:

1- Creation of the Input Molecular Structure File

This information can be created in an editor for the SMILES file (or if your are really talented for the SDF and MOL2 formats as well). Otherwise, it can be created from a third party software like Isis, Sybyl, ChemDraw, or BABEL.

2- Creation of the Control File

This is created using a text editor. The molecular structure file format of the file provide in argument 2 MUST match the keyword INPUTFORMAT. Use other keywords to modify the output file format.

3- Execution of Program Molconn-Z

On the command line of a UNIX or DOS/Console shell window:

molconnz <name of control file> <name of molecular structure file> <name of output file> [>& molconnz.log]

4- Using Statistical Analysis Software to Evaluate Value of Descriptors

In order to make best use of the descriptor information calculated by Molconn-Z it is critical that the user analyze the Molconn-Z data using a sophisticated statistical software like SAS or MATLAB.

Demo Molconn-Z Sessions

Using the demo files described in Chapter 3, we can test the output and function of Molconn-Z.

First, copy the files in the molconnz4.03/demo directory into a working directory on your computer. (Note: for the Windows version, the file names for the test will need to be different since Windows does not distinguish between demo1.s and demo1.S.)

DEMO1

The file demo1.smi is simply benzene. To run (the $MCONN_RUN is not necessary if this directory is in the $PATH):

$MCONN_RUN/molconnz demo_control.dat demo1.smi demo1.s

Compare demo1.s (new) and demo1.S (archival) for differences.

DEMO2

The file demo2.smi contains 100 molecules of varying complexity. To run:

$MCONN_RUN/molconnz demo_control.dat demo2.smi demo2.s

Compare demo2.s with demo2.S.

To see the CSV output format, edit demo_control.dat to remove all the lines related to records (NORECORDS and all ADD lines), then add the keyword CSVUSEFUL. To run:

$MCONN_RUN/molconnz demo_control_csv.dat demo2.smi demo2_csv.s

Compare demo2_csv.s with demo2_csv.S.

DEMO3

Edit demo_control.dat and change the INPUTFORMAT keyword from oelibsmiles to oelibsdf.

The file demo3.sdf contains 12 simple molecules. To run:

$MCONN_RUN/molconnz demo_control.dat demo3.sdf demo3.s

Compare demo3.s with demo3.S

DEMO4

The file demo4.sdf contains 50 molecules of moderate complexity. To run:

$MCONN_RUN/molconnz demo_control.dat demo4.sdf demo4.s

Compare demo4.s with demo4.S

DEMO5

Edit demo_control.dat and change the INPUTFORMAT keyword to oelibmol2.

The file demo5.mol2 contains 10 fairly simple molecules. To run:

$MCONN_RUN/molconnz demo_control.dat demo5.mol2 demo5.s

Compare demo5.s with demo5.S

DEMO6

Edit demo_control.dat and change the INPUTFORMAT keyword to oelibmol.

The files demo6a.mol and demo6b.mol contain phenol, and a substituted phenol, respectively. To run demo6a:

$MCONN_RUN/molconnz demo_control.dat demo6a.mol demo6a.s

Compare demo6a.s with demo6a.S

ESLC_MOLCONNZ_ERROR_CIRCUITBUFFEROVERFLOW	1
ESLC_MOLCONNZ_ERROR_PATHBUFFEROVERFLOW	2
ESLC_MOLCONNZ_ERROR_DISCONNECTEDGRAPH	3
ESLC_MOLCONNZ_ERROR_EDGESOVERFLOW	4
ESLC_MOLCONNZ_ERROR_MONOATOMIC_MOLECULE	5
ESLC_MOLCONNZ_ERROR_UNPARAMETERIZED_ELEMENT	6
ESLC_MOLCONNZ_ERROR_FILE_READ_FAILURE	90