molconnz <name of control file> <name of molecular structure file> <name of output file>for example:
molconnz demo_control.dat demo.sdf demo.sdf.outCONTROL FILE
The control file contains keywords which define the molecular structure file format, the algorithms used, the information output to the output file, etc. The keywords are described in the table below. An example of this file is shown here.
NORECORDS ADD 1 ADD 3 ADD 4 ADD 8 ADD 9 HEADERS top INPUTFORMAT oelibsmiles WARNINGS off INDEX on ERRORS skip GOEach control file must contain a "GO" at the end of all other options. In this control file we only want to view the descriptors associated with records 1, 3, 4, 6, and 9, thus keywords NORECORDS plus ADD for each record number. These are the simple and valence Chi indices (see Appendix I). Other options are ALLRECORDS (which selects all records) and USEFULRECORDS (which selects a list of records that have been determined by experience to be generally useful). A new option with 4.09 is the keyword CSVUSEFUL, which selects the USEFUL record set and prints a single line for each compound with commas separating the descriptor data. New with 4.10 the group-type H E-State descriptors (35) are added to CSVUSEFUL. Also, we want to know the name of the each descriptor in a HEADER at the "top" of the file. Other options are to print HEADERS for "all" records, or turn them "off"; and to write (to stderr) an INDEX number for each processed molecule. Our molecular structure file will be in the SMILES format, but read by the OELIB code. The 5 options for INPUTFORMAT are "daylightsmiles", "oelibsmiles", "oelibsdf", "oelibmol" and "oelibmol2". WARNINGS are turned off, and ERRORS will cause the code to skip the calculations on that specific structure, and procede to the next. Since a common source for error is "disconnected graph" due to salt structures, a new option to ERRORS is added with 4.10, "fix", which will analyse the separate components of the salt. The ouput will then contain additional molecule numbers, for example 3_A and 3_B.
There are 4 calculation parameters
that can be set using keywords:MAXORDER for the longest chain length, CTYPE
for the type of formalism used in the connectivity calculations, STYPE for the
type of formalism used in the topological calculations, and STATEFUNCTION
for the distance function used in the topological calculations (see
Chapter
2 for further details).
Table 3: Control File KEYWORDS
keyword | options | function |
---|---|---|
ALLRECORDS | n/a | set all record selectors TRUE |
NORECORDS | n/a | set all record selectors FALSE |
USEFULRECORDS* | n/a | set the record selector for "useful" records TRUE (all others FALSE) |
CSVUSEFUL | n/a | set the record selector for "useful" records TRUE (all others FALSE) and print a single line with data separated by commas |
ADD | n (an integer record selector) | set identified record TRUE |
REMOVE | n (an integer record selector) | set identified record FALSE |
HEADERS | off* | do not write a header/identifier in output |
top | write a header/identifier at top of output | |
all | write a header/identifier in output for each molecule | |
MAXORDER | n (an integer)
99* |
longest search path length for paths |
CTYPE | 0* | use reciprocal-squareroot formalism for connectivity calculations |
1 | use geometric mean formalism for connectivity calculations | |
STYPE | 0 | use reciprocal-squareroot formalism for topological calculations |
1* | use geometric mean formalism for topological calculations | |
STATEFUNCTION | 1 | use distance/geometric mean topological function |
2 | use geometric mean/distance topological function | |
3 | use distance**2/geometric mean topological function | |
4* | use geometric mean/distance**2 topological function | |
5 | use distance**3/geometric mean topological function | |
6 | use geometric mean/distance**3 topological function | |
INPUTFORMAT | daylightsmiles | use (licensed) Daylight toolkit to interpret molecule structure |
oelibsmiles | use OElib SMILES routine to interpret molecule structure | |
oelibsdf | use OElib SDF (MDL) routine to interpret molecule structure | |
oelibmol | use OElib MOL (MDL) routine to interpret molecule structure | |
oelibmol2 | use OElib MOL2 (Sybyl) routine to interpret molecule structure | |
WARNINGS | on* | write non-serious warning messages to stderr |
off | do not write non-serious warning messages to stderr | |
ERRORS | exit | exit on serious error |
skip* | skip current calculation, move to next molecule on serious error | |
continue | continue current calculation, even with compromised input data | |
fix | fix "disconnected graph" which occurs with salts, by evaluating each part of the salt | |
INDEX | on* | write index number for each processed molecule to stderr |
off | do not write index number for each processed molecule to stderr | |
GO | n/a | end of options input, begin calculations |
* - default settings for program parameters.
MOLECULAR STRUCTURE FILE
The program expects an input molecular structure file which can be in one of three formats. The formats are described in more detail in Chapter 5
Daylight SMILES format read by Daylight SMILES Toolkit (this requires a run-time "smiles" license from Daylight).
MDL SDFile format read by OElib code from OpenEye (this does not require any additional code or license).
Tripos Sybyl/MOL2
format read by OElib code from OpenEye
(this does not require any additional code or license).
OUTPUT FILE
The structure of the output file depends on the
keywords used in the control file. For example, use of the the keywords
USEFULRECORDS and HEADERS "top" will provide output for most interesting
descriptors with a listing of the descriptor names. Note that not all molecules
may have a value for a particular descriptor, in which case it will be
0.0. One should be careful using the keywords ALLRECORDS and HEADERS "all"
as this could produce a very large output file if your database of molecules
is large. The output file will look distintively different if the keyword
CSVUSEFUL is used. In this case all descriptor values for each compound
will be in a single line separated only by commas. This type of output
may help read the data into certain data analysis packages (like OpenOffice or EXCEL).
There is a slight difference in the record #1 in the CSVUSEFUL versus
a non-CSV output. The CSVUSEFUL output will include an error code number
as the second value (after the first comma). In a normal run this value
will be "0". If there is a problem with the run, the value of this field
will not be "0" and can be determined from the following table (see
When an error is encountered, the line of output will be substantially reduced.
In the case of CSVUSEFUL, there will only be 4 values (the second being non-zero),
and in the case of printing RECORD 1 (not using CSVUSEFUL) there will be 12 entries,
the last of which will be "not_available". Since RECORD 1 does not contain a field
for errorcode, the errorcode (see table above) is recorded in the nclass field of
RECORD 1 only when there is an error. Otherwise the value in nclass will be nclass,
or the number of classes of topologically (symmetry) equivalent graph vertices.
Also the order of the descriptors is slightly different as shown here:
ESLC_MOLCONNZ_ERROR_CIRCUITBUFFEROVERFLOW 1
ESLC_MOLCONNZ_ERROR_PATHBUFFEROVERFLOW 2
ESLC_MOLCONNZ_ERROR_DISCONNECTEDGRAPH 3
ESLC_MOLCONNZ_ERROR_EDGESOVERFLOW 4
ESLC_MOLCONNZ_ERROR_MONOATOMIC_MOLECULE 5
ESLC_MOLCONNZ_ERROR_UNPARAMETERIZED_ELEMENT 6
ESLC_MOLCONNZ_ERROR_FILE_READ_FAILURE 90
CSVUSEFUL molnumber,errorcode,molname,nvx,nedges,nrings,ncircuits,nclass,nelem,ntpaths,molweight NORECORDS ADD 1 moleculenumber narecs nvx nedges nrings ncircuits nclass nelem ntpaths molweight molname formula
Typically, an analysis with Molconn-Z might involve evaluating all records on a few representative molecules, followed by selecting a limited number of descriptors to evaluate a large database of molecules. This limited number of descriptors will probably paired down even further to something less than 10. The descriptors are output in the order defined by the records (see Appendix I) and in the order of the molecules listed in the input structure file.
For the cases where a large database is used, and
INDEX is set to "on", it may be desirable to save this indexing information
to a file so that you may determine which molecules in the database have
problems that may need correcting. This information is printed to the
"stderr" port and therfore it can be collected in a separate file using
the following command:
UNIX/LINUX: $MCONN_RUN/molconnz control.dat database1.smi database1.s >& molconnz.log
Windows 2000/XP: MOLCONNZ control.dat database1.smi database1.s 2> molconnz.log
You should note that the LICENSE errors (which are the most common) are also
printed to the "stderr" port, so anytime your output file is missing information,
you should check the terminal or the stderr log file.
molconnz <name of control file> <name
of molecular structure file> <name of output file> [>& molconnz.log]
First, copy the files in the molconnz4.03/demo directory into a working directory on your computer. (Note: for the Windows version, the file names for the test will need to be different since Windows does not distinguish between demo1.s and demo1.S.)
DEMO1
The file demo1.smi is simply benzene. To run (the $MCONN_RUN is not necessary if this directory is in the $PATH):
$MCONN_RUN/molconnz demo_control.dat demo1.smi demo1.s
Compare demo1.s (new) and demo1.S (archival)
for differences.
DEMO2
The file demo2.smi contains 100 molecules of varying complexity. To run:
$MCONN_RUN/molconnz demo_control.dat demo2.smi demo2.s
Compare demo2.s with demo2.S.
To see the CSV output format, edit demo_control.dat to remove all the lines related to records (NORECORDS and all ADD lines), then add the keyword CSVUSEFUL. To run:
$MCONN_RUN/molconnz demo_control_csv.dat demo2.smi demo2_csv.s
Compare demo2_csv.s with demo2_csv.S.
DEMO3
Edit demo_control.dat and change the INPUTFORMAT keyword from oelibsmiles to oelibsdf.
The file demo3.sdf contains 12 simple molecules. To run:
$MCONN_RUN/molconnz demo_control.dat demo3.sdf demo3.s
Compare demo3.s with demo3.S
DEMO4
The file demo4.sdf contains 50 molecules of moderate complexity. To run:
$MCONN_RUN/molconnz demo_control.dat demo4.sdf demo4.s
Compare demo4.s with demo4.S
DEMO5
Edit demo_control.dat and change the INPUTFORMAT keyword to oelibmol2.
The file demo5.mol2 contains 10 fairly simple molecules. To run:
$MCONN_RUN/molconnz demo_control.dat demo5.mol2 demo5.s
Compare demo5.s with demo5.S
DEMO6
Edit demo_control.dat and change the INPUTFORMAT keyword to oelibmol.
The files demo6a.mol and demo6b.mol contain phenol, and a substituted phenol, respectively. To run demo6a:
$MCONN_RUN/molconnz demo_control.dat demo6a.mol demo6a.s
Compare demo6a.s with demo6a.S