DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins

Supplementary information

Main page: http://lcg.rit.albany.edu/dp-bind

 

Description of the three machine learning algorithms

For our two-class ( DNA -binding and non-binding residues) classification problem, we applied three machine learning algorithms: support vector machine ( SVM ) (Vapnik, 1998), kernel logistic regression (KLR) (Zhu and Hastie, 2005), and penalized logistic regression ( PLR ) (le Cessie and van Houwelingen, 1992). SVM is a margin maximizing classifier that does a linear classification in the feature space, which corresponds to a non-linear classification in the original data space. The feature space is obtained by transforming data from the original data space with a kernel function. Similarly, KLR and PLR are also margin maximizing classifiers. Optimization problem of KLR and PLR is also similar to that of SVM , except that they use exponential loss function instead of L1 loss function of SVM . Both KLR and PLR provide classification results based on conditional class probability. The difference is that PLR does the classification in the original data space, whereas KLR does the classification in the feature space by using a kernel function. In other words, KLR is a non-linear version of PLR . For both SVM and KLR we used the radial basis function kernel, which showed the best performance on our dataset.

Method of performance evaluation

We used leave-one-protein-out cross-validation to train and test each classifier. In this procedure, 61 protein complexes are used for training and the remaining one complex is used for testing. This process is repeated 62 times so that each protein complex can be tested. Optimal parameters in a classifier were determined to be those which achieved the best average accuracy across the 62 cross-validation experiments. To assess the given classifier's performance, we computed mean and standard deviation of the following performance measures across the 62 cross-validation experiments:

(i) accuracy: ACC=(TP+TN)/(TP+FP+TN+FN)

(ii) sensitivity: SN=TP/(TP+FN)

(iii) specificity: SP=TN/(FP+TN)

Where TP is the number of true positives (correctly predicted DNA -binding residues), FN is the number of false negatives ( DNA -binding residue predicted as being non-binding), TN is the number of true negatives (correctly predicted non-binding residues), FP is the number of false positives (non-binding residues predicted as being DNA -binding).

Output format

Figure 1. A sample prediction result, consisting of three parts: a header describing its format, inputted FASTA sequence, and a prediction result in a columnar format. The result was truncated to fit the page. Refer to help for details.

Performance measures from previous studies

Table 1. Performance measures from previous studies. For DBS- PRED , results from SeqPredNet are shown since this version is implemented as the web server. Similarly, results from PDNA- RDN are shown for DBS-PSSM. In three studies (DBS-PRED, DBS-PSSM, and BindN) as well as our current study, specificity is calculated as TN/(FP+TN), whereas it is calculated as TP/(FP+TP) in Yan et al . (2006).

 

Accuracy

Sensitivity

Specificity

DBS-PRED (Ahmad et al., 2004)

64.5

68.6

63.4

DBS-PSSM (Ahmad and Sarai, 2005)

64.0

67.1

63.3

BindN (Wang and Brown, 2006)

70.31

69.40

70.47

Yan et al. (2006)

78

44

41

How classifiers perform on different structural classes of proteins and on individual proteins

Table 2. The average accuracy of SVM classifiers tested on the four major structural classes of proteins (the balanced training/test set, for details see Kuznetsov et al, 2006, Proteins). In each cell the fraction of residues predicted correctly is shown.

CATH structural class

 

seq-SVM accuracy
(BLOSUM62 encoding)
average ± std.deviation

pssm-SVM accuracy

average ± std.deviation

alpha

0.72 ± 0.102

0.81 ± 0.098

alpha/beta

0.67 ± 0.084

0.72 ± 0.098

beta

0.67 ± 0.042

0.79 ± 0.070

few regular structure

0.71 ± 0.061

0.88 ± 0.065


Table 3.
The accuracy of SVM classifiers for each individual protein chain (as determined by leave-one-out cross-validation on the balanced dataset used to train and test the classifiers, for details see Kuznetsov et al, 2006, Proteins). In each cell the fraction of residues predicted correctly is shown.

Protein PDB id

(click hyperlink to get CATH annotaion)

seq-SVM accuracy (BLOSUM62 encoding)

pssm-SVM accuracy

1a02F

0.90909

0.95455

1a02J

0.72727

0.95455

1a02N

0.70455

0.86364

1a74A

0.67969

0.59375

1aayA

0.73438

0.82812

1azqA

0.71053

0.76316

1b3tA

0.72727

0.75758

1bf5A

0.65217

0.69565

1bhmA

0.56452

0.72581

1bl0A

0.67241

0.87931

1c0wB

0.61111

0.77778

1cdwA

0.44595

0.64865

1cf7A

0.67647

0.70588

1cjgA

0.57292

0.76042

1cmaA

0.82

0.72

1d02A

0.65

0.7

1d66A

0.56667

0.86667

1dp7P

0.875

0.875

1ecrA

0.67241

0.77586

1fjlA

0.73684

0.89474

1gatA

0.56522

0.65217

1gccA

0.63636

0.70455

1gdtA

0.60938

0.83594

1hcqA

0.7459

0.72131

1hcrA

0.80769

0.88462

1hddC

0.63333

0.91667

1hloA

0.73077

0.90385

1hryA

0.57692

0.65385

1hwtC

0.64286

0.77857

1if1A

0.63725

0.78431

1ignA

0.83333

0.77778

1ihfA

0.78125

0.8125

1ihfB

0.7

0.83333

1j59A

0.65789

0.76316

1lmb4

0.80435

0.76087

1mdyA

0.63636

0.95455

1meyC

0.69355

0.85484

1mhdA

0.76923

0.65385

1mnmB

0.77778

0.88889

1mnmD

0.875

0.95833

1mseC

0.69318

0.75

1octC

0.7625

0.7375

1parB

0.78571

0.64286

1pdnC

0.6

0.83333

1perL

0.77027

0.85135

1pnrA

0.65385

0.88462

1pueE

0.64

0.8

1pviB

0.62903

0.67742

1pyiA

0.85

0.9

1repC

0.66667

0.65152

1srsA

0.85366

0.86585

1svcP

0.64706

0.70588

1tc3C

0.75

0.88889

1tf3A

0.75

0.80952

1troA

0.72222

0.80556

1tsrB

0.64286

0.82143

1ubdC

0.66176

0.73529

1xbrA

0.62963

0.7037

1yrnA

0.5

0.88462

1ysaC

0.84615

0.94231

1yuiA

0.65385

0.5

2bopA

0.8125

0.9375

2drpA

0.65385

0.75

2gliA

0.625

0.675

2hdcA

0.69444

0.65278

3croL

0.675

0.85


Histograms that provide a graphic summary of the results reported in Table 3

Figure 2.

Figure 3.

 

References

Ahmad,S., Gromiha,M.M. and Sarai,A. (2004) Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics , 20, 477-486.

Ahmad,S. and Sarai,A. (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics , 6, 33-38.

Kuznetsov,I.B., Gou,Z., Li,R. and Hwang,S. (2006) Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins. Proteins , 64, 19-27.

le Cessie,S. and van Houwelingen,J.C. (1992) Ridge estimators in logistic regression. Appl. Statist. , 41, 191-201. [Link]

Vapnik,V.N. (1998) Statistical Learning Theory . John Wiley and Sons, New York .

Wang,L. and Brown,S.J. (2006) BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res. , 34, W243-248.

Yan,C., Terribilini,M., Wu,F., Jernigan,R.L., Dobbs,D. and Honavar,V. (2006) Predicting DNA-binding sites of proteins from amino acid sequence. BMC Bioinformatics , 7, 262

Zhu,J. and Hastie,T. (2005) Kernel logistic regression and the import vector machine. J. Comp. Graph. Stat. , 14, 185-205. [Postscript]