DP-BIND Supplementary information

DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins

Supplementary information

Main page: http://lcg.rit.albany.edu/dp-bind

Description of the three machine learning algorithms

For our two-class ( DNA -binding and non-binding residues) classification problem, we applied three machine learning algorithms: support vector machine ( SVM ) (Vapnik, 1998), kernel logistic regression (KLR) (Zhu and Hastie, 2005), and penalized logistic regression ( PLR ) (le Cessie and van Houwelingen, 1992). SVM is a margin maximizing classifier that does a linear classification in the feature space, which corresponds to a non-linear classification in the original data space. The feature space is obtained by transforming data from the original data space with a kernel function. Similarly, KLR and PLR are also margin maximizing classifiers. Optimization problem of KLR and PLR is also similar to that of SVM , except that they use exponential loss function instead of L1 loss function of SVM . Both KLR and PLR provide classification results based on conditional class probability. The difference is that PLR does the classification in the original data space, whereas KLR does the classification in the feature space by using a kernel function. In other words, KLR is a non-linear version of PLR . For both SVM and KLR we used the radial basis function kernel, which showed the best performance on our dataset.

Method of performance evaluation

We used leave-one-protein-out cross-validation to train and test each classifier. In this procedure, 61 protein complexes are used for training and the remaining one complex is used for testing. This process is repeated 62 times so that each protein complex can be tested. Optimal parameters in a classifier were determined to be those which achieved the best average accuracy across the 62 cross-validation experiments. To assess the given classifier's performance, we computed mean and standard deviation of the following performance measures across the 62 cross-validation experiments:

(i) accuracy: ACC=(TP+TN)/(TP+FP+TN+FN)

(ii) sensitivity: SN=TP/(TP+FN)

(iii) specificity: SP=TN/(FP+TN)

Where TP is the number of true positives (correctly predicted DNA -binding residues), FN is the number of false negatives ( DNA -binding residue predicted as being non-binding), TN is the number of true negatives (correctly predicted non-binding residues), FP is the number of false positives (non-binding residues predicted as being DNA -binding).

Output format

Figure 1. A sample prediction result, consisting of three parts: a header describing its format, inputted FASTA sequence, and a prediction result in a columnar format. The result was truncated to fit the page. Refer to help for details.

Performance measures from previous studies

Table 1. Performance measures from previous studies. For DBS- PRED , results from SeqPredNet are shown since this version is implemented as the web server. Similarly, results from PDNA- RDN are shown for DBS-PSSM. In three studies (DBS-PRED, DBS-PSSM, and BindN) as well as our current study, specificity is calculated as TN/(FP+TN), whereas it is calculated as TP/(FP+TP) in Yan et al . (2006).

	Accuracy	Sensitivity	Specificity
DBS-PRED (Ahmad et al., 2004)	64.5	68.6	63.4
DBS-PSSM (Ahmad and Sarai, 2005)	64.0	67.1	63.3
BindN (Wang and Brown, 2006)	70.31	69.40	70.47
Yan et al. (2006)	78	44	41

How classifiers perform on different structural classes of proteins and on individual proteins

Table 2. The average accuracy of SVM classifiers tested on the four major structural classes of proteins (the balanced training/test set, for details see Kuznetsov et al, 2006, Proteins). In each cell the fraction of residues predicted correctly is shown.

CATH structural class	seq-SVM accuracy (BLOSUM62 encoding) average ± std.deviation	pssm-SVM accuracy average ± std.deviation
alpha	0.72 ± 0.102	0.81 ± 0.098
alpha/beta	0.67 ± 0.084	0.72 ± 0.098
beta	0.67 ± 0.042	0.79 ± 0.070
few regular structure	0.71 ± 0.061	0.88 ± 0.065

Table 3. The accuracy of SVM classifiers for each individual protein chain (as determined by leave-one-out cross-validation on the balanced dataset used to train and test the classifiers, for details see Kuznetsov et al, 2006, Proteins). In each cell the fraction of residues predicted correctly is shown.

Protein PDB id (click hyperlink to get CATH annotaion)	seq-SVM accuracy (BLOSUM62 encoding)	pssm-SVM accuracy
1a02F	0.90909	0.95455
1a02J	0.72727	0.95455
1a02N	0.70455	0.86364
1a74A	0.67969	0.59375
1aayA	0.73438	0.82812
1azqA	0.71053	0.76316
1b3tA	0.72727	0.75758
1bf5A	0.65217	0.69565
1bhmA	0.56452	0.72581
1bl0A	0.67241	0.87931
1c0wB	0.61111	0.77778
1cdwA	0.44595	0.64865
1cf7A	0.67647	0.70588
1cjgA	0.57292	0.76042
1cmaA	0.82	0.72
1d02A	0.65	0.7
1d66A	0.56667	0.86667
1dp7P	0.875	0.875
1ecrA	0.67241	0.77586
1fjlA	0.73684	0.89474
1gatA	0.56522	0.65217
1gccA	0.63636	0.70455
1gdtA	0.60938	0.83594
1hcqA	0.7459	0.72131
1hcrA	0.80769	0.88462
1hddC	0.63333	0.91667
1hloA	0.73077	0.90385
1hryA	0.57692	0.65385
1hwtC	0.64286	0.77857
1if1A	0.63725	0.78431
1ignA	0.83333	0.77778
1ihfA	0.78125	0.8125
1ihfB	0.7	0.83333
1j59A	0.65789	0.76316
1lmb4	0.80435	0.76087
1mdyA	0.63636	0.95455
1meyC	0.69355	0.85484
1mhdA	0.76923	0.65385
1mnmB	0.77778	0.88889
1mnmD	0.875	0.95833
1mseC	0.69318	0.75
1octC	0.7625	0.7375
1parB	0.78571	0.64286
1pdnC	0.6	0.83333
1perL	0.77027	0.85135
1pnrA	0.65385	0.88462
1pueE	0.64	0.8
1pviB	0.62903	0.67742
1pyiA	0.85	0.9
1repC	0.66667	0.65152
1srsA	0.85366	0.86585
1svcP	0.64706	0.70588
1tc3C	0.75	0.88889
1tf3A	0.75	0.80952
1troA	0.72222	0.80556
1tsrB	0.64286	0.82143
1ubdC	0.66176	0.73529
1xbrA	0.62963	0.7037
1yrnA	0.5	0.88462
1ysaC	0.84615	0.94231
1yuiA	0.65385	0.5
2bopA	0.8125	0.9375
2drpA	0.65385	0.75
2gliA	0.625	0.675
2hdcA	0.69444	0.65278
3croL	0.675	0.85

Histograms that provide a graphic summary of the results reported in Table 3

Figure 2.

Figure 3.

References

Ahmad,S., Gromiha,M.M. and Sarai,A. (2004) Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics , 20, 477-486.

Ahmad,S. and Sarai,A. (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics , 6, 33-38.

Kuznetsov,I.B., Gou,Z., Li,R. and Hwang,S. (2006) Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins. Proteins , 64, 19-27.

le Cessie,S. and van Houwelingen,J.C. (1992) Ridge estimators in logistic regression. Appl. Statist. , 41, 191-201. [Link]

Vapnik,V.N. (1998) Statistical Learning Theory . John Wiley and Sons, New York .

Wang,L. and Brown,S.J. (2006) BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res. , 34, W243-248.

Yan,C., Terribilini,M., Wu,F., Jernigan,R.L., Dobbs,D. and Honavar,V. (2006) Predicting DNA-binding sites of proteins from amino acid sequence. BMC Bioinformatics , 7, 262

Zhu,J. and Hastie,T. (2005) Kernel logistic regression and the import vector machine. J. Comp. Graph. Stat. , 14, 185-205. [Postscript]