DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins

Main page: http://lcg.rit.albany.edu/dp-bind

Summary

DP-Bind is a web server for sequence-based prediction of DNA-binding residues in proteins that bind double-stranded DNA. DP-Bind implements three machine learning methods: support vector machine (SVM), kernel logistic regression (KLR), and penalized logistic regression (PLR). These methods were trained on a non-redundant dataset of 62 experimentally determined structures of protein-dsDNA complexes. Prediction can be performed using either the input sequence alone or a profile of evolutionary conservation of the input sequence in the form of PSI-BLAST position-specific scoring matrix (PSSM) automatically generated by the web-server. The outputs of all three individual methods are combined into a consensus prediction to help identify positions predicted with high level of confidence. DP-Bind reports two consensus predictions. One is majority consensus obtained by majority voting. For instance, if two methods predict a given position as 'DNA-binding' and the third predicts it as 'non-binding', the majority consensus label is 'DNA-binding'. The other is strict consensus obtained by unanimous agreement. For instance, if one method disagrees with the other two, no consensus label is assigned to a given sequence position (denoted N/A). Thus, the strict consensus retains only a sub-set of high confidence predictions on which all three methods agree.

The average performance of predictors that utilize evolutionary information

Predictor

Accuracy

Sensitivity

Specificity

pssm-SVM

76.0% +/- 9.1

76.9% +/- 18.6

74.8% +/- 12.5

pssm-KLR

77.2% +/- 9.3

76.4% +/- 18.5

76.6% +/- 11.2

pssm-PLR

73.0% +/- 8.8

73.3% +/- 18.4

71.8% +/- 12.8

Majority consensus

76.4% +/- 9.0

76.9% +/- 18.6

75.3% +/- 12.0

Strict consensus*

80.0% +/- 9.4

79.1% +/- 19.4

78.6% +/- 12.7


The average performance of BLOSUM62 predictors that do not utilize evolutionary information

Predictor

Accuracy

Sensitivity

Specificity

seq-SVM

68.2% +/- 6.6

70.4% +/- 16.5

66.8% +/- 9.2

seq-KLR

68.6% +/- 5.5

66.8% +/- 15.4

68.9% +/- 7.8

seq-PLR

67.8% +/- 6.9

69.0% +/- 13.3

67.0% +/- 9.0

Majority consensus

69.1% +/- 6.2

69.9% +/- 16.1

68.2% +/- 8.6

Strict consensus*

72.2% +/- 7.2

73.1% +/- 16.3

71.4% +/- 9.8

*Please note that in the strict consensus some sequence positions are not assigned a label when one of the prediction methods disagrees with the other two. As a result, the total number of residues used to assess the performance of the strict consensus is usually smaller than that used to assess the performance of the individual methods and the majority consensus. See Hwang et al (2007) for details.

For more details on performance evaluation, comparison to other methods for predicting DNA-binding sites, and how our predictors perform on proteins from particular structural classes please refer to the Supplementary Information.

More information on the three prediction algorithms can be found in the following references:

1. Input sequence format

This web server accepts one or more amino acid sequences in FASTA format, which consists of a header line and amino acid sequence string lines, as follows:
>1A02:F C-FOS
MKRRIRRERNKMAAAKSRNRRRELTDTLQAETDQLEDEKSALQTEIANLLKEKEKL
>1A74:A INTRON-ENCODED ENDONUCLEASE I-PPOI
MALTNAQILAVIDSWEETVGQFPVITHHVPLGGGLQGTLHCYEIPLAAPYGVGFAKNGPT
RWQYKRTINQVVHRWGSHTVPFLLEPDNINGKTCTASHLCHNTRCHNPLHLCWESLDDNK
GRNWCPGPNGGCVHAVVCLRQGPLYGPGATVAGPQQRGSHFVV
>seq1
qilavidsweetvgqfpvithhvplggglqgtlhcyeiplaapygvgfakngpt
>
INQVVHRWGSHTVPFLLEPDNINGKTCTASHLCHNTRCHNPLHLCWESLDDNK

1.1. FASTA header line

The FASTA header line begins with a ">" character, followed by an optional identifier and description, as follows:
  >1A02:F C-FOS
If you have a raw sequence that consists of sequence string alone, without a FASTA header line, just add a header line consisting of either:

1.2. FASTA sequence string

Sequence string should be represented using the standard IUB/IUPAC one-letter amino acid codes, which includes:
twenty characters for twenty amino acids;
  A  Alanine          M  Methionine
  C  Cysteine         N  Asparagine
  D  Aspartate        P  Proline
  E  Glutamate        Q  Glutamine
  F  Phenylalanine    R  Arginine
  G  Glycine          S  Serine
  H  Histidine        T  Threonine
  I  Isoleucine       V  Valine
  K  Lysine           W  Tryptophan
  L  Leucine          Y  Tyrosine
and the following three characters;
  B  Aspartate or Asparagine 
  Z  Glutamate or Glutamine
  X  Unknown
These one-letter codes can be in either upper-case or lower-case.
Under
sequence-based encoding, this web server treats B and Z as if they were Aspartate (D) and Glutamate (E), respectively. For PSSM-based encoding, no such conversion by the web server was necessary.
Residues in X whose amino acid types are unknown are excluded from prediction. Specifically this web server will not predict the binding labels of:

2. Number and length of sequences

The maximum number of sequences that can be submitted at a time is 1. The maximum allowed sequence length is 1000.

3. Encoding methods

Here, encoding refers to a way to convert the letter string of an amino acid sequence into a numerical representation so that a numerical computation can be done with it. This web server uses the following three encoding methods, all of which convert the one-letter amino acid code at each residue into a vector with 20 entries.
We advise users to use the default method of PSSM-based encoding which yields the most accurate prediction. If users nevertheless want to choose a sequence-based encoding instead, the average accuracy of the BLOSUM62 encoding has been shown to be slightly higher than that of the binary encoding (
Kuznetsov et al, 2006).

3.1. PSSM-based encoding

On the average, the PSSM-based encoding gives the most accurate prediction results. It is however the slowest method because PSI-BLAST may take as much as several minutes to finish for a single protein.
In this method, the vector is derived from the position-specific score in the PSSM (Position-Specific Scoring Matrix) generated by PSI-BLAST. This position-specific score describes how well an amino acid type at that position is evolutionarily conserved across all proteins that are homologous to the input protein.
Suppose that a PSSM was generated by PSI-BLAST for a given protein. Each row in the PSSM contains log-likelihood values of 20 amino acid types at each position. For example, a row for residue A in a PSSM may contain the following vector:
(5, -4, -3, -4, -3, -3, -3, -2, -4, -4, -3, -2, -4, -4, -4, 4, -2, -5, 2, -3)
Each entry in this vector is scaled between 0 and 1 using a logistic function to produce a normalized vector:
(0.9933, 0.0180, 0.0474, 0.0180, 0.0474, 0.0474, 0.0474, 0.1192, 0.0180, 0.0180, 0.0474, 0.1192, 0.0180, 0.0180, 0.0180, 0.9820, 0.1192, 0.0067, 0.8808, 0.0474)

3.2. Sequence-based BLOSUM62 encoding

In this method, the vector is derived from the score in BLOSUM62 substitution matrix. This substitution score describes how well an amino acid type is evolutionarily compatible with other types across similar proteins with 62% sequence identity. For example, the row for residue A in the BLOSUM62 matrix is the following vector:
(4, 0, -2, -1, -2, 0, -2, -1, -1, -1, -1, -2, -1, -1, -1, 1, 0, 0, -3, -2)
Each entry in this vector is scaled between 0 and 1 using a logistic function to give rise to a normalized vector:
(0.9820, 0.5000, 0.1192, 0.2689, 0.1192, 0.5000, 0.1192, 0.2689, 0.2689, 0.2689, 0.2689, 0.1192, 0.2689, 0.2689, 0.2689, 0.7311, 0.5000, 0.5000, 0.0474, 0.1192)

3.3. Sequence-based binary encoding

In this method, each entry in the vector represents each of the 20 amino acid types. One and only one entry corresponding to the observed amino acid type is given a value 1, and the remaining 19 entries are set to 0. For example, the residue A is encoded by the following vector:
(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

4. Retrieval of the results of the prediction

We recommend users to enter their e-mail addresses so that web server can automatically mail the result when the prediction finishes.
Alternatively, users may select the second option in order to get the URL link to their processes and to manually check the prediction result.
Prediction result will be kept on the web server for one day from submission, and deleted afterwards.

5. Output format

Figure 1 shows a sample prediction result that consists of three parts. Part 1 describes the output format itself. Part 2 shows the submitted sequence in FASTA format. Part 3 shows the results of the prediction.

Labels in columns S_LBL, K_LBL, and P_LBL correspond to binding labels predicted by the
three prediction methods: Support Vector Machine (SVM), Kernel Logistic Regression (KLR), and Penalized Logistic Regression (PLR), respectively. The labels 1 and 0 stand for DNA-binding and non-binding residue, respectively.

Binding labels in columns MAJ_CON and STR_CON show the consensus predictions:

1) MAJ_CON is the majority consensus label obtained by the majority vote of the three labels in columns S_LBL, K_LBL, and P_LBL (i.e., MAJ_CON is the label predicted by at least two methods).
2) STR_CON is the strict consensus label obtained by a unanimous agreement of the three labels in columns S_LBL, K_LBL, and P_LBL. If one of the three labels in S_LBL, K_LBL, P_LBL is different from the other two, the strict consensus is not assigned (NA).

Columns S_PRB, K_PRB, and P_PRB show the probabilities of the predicted labels. These probabilities are in range [0.5 to 1.0]. The higher the probability, the greater the confidence of the predicted label. A probability close to 0.5 indicates a low-confidence prediction.

We suggest using the majority consensus as a default method. If the user wishes to use an individual method, we suggest using either KLR classifier or SVM classifier that show the highest average accuracy. The strict consensus should be used as a supplement to identify residues predicted with high confidence.


Figure 1. A sample prediction result, consisting of three parts: [Part 1] a header describing its format, [Part 2] submitted FASTA sequence, and [Part 3] a prediction result in a columnar format. The result was truncated to fit the page.

6. Expected wait time

The expected wait time depends on the encoding method and the number of previously submitted jobs in the queue. The PSSM encoding can take 5 to 10 (or even more) minutes per sequence. Please use 'Check the status of the job queue' link at the top of the input form to see the estimated wait time (this is a very rough estimate).

If you do not receive results within 48 hours, please contact
Igor Kuznetsov.

7. An example

An examplar prediction result is provided below for protein 1AZQ chain A, whose actual binding labels can be obtained from PDB. Note that this protein is included in the training set. Thus, the web server prediction measures will be different from those obtained using the cross-validation experiments.
Figure 2 shows web server prediction output using sequence-based BLOSUM62 encoding option. Figure 3 shows web server prediction output using PSSM-based encoding option.
Appended in the rightmost column are the actual binding labels (Binding, Non-binding) that are obtained from the atomic coordinates in the corresponding PDB file.
Predicted labels from Support Vector Machine (S_LBL), Kernel Logistic Regression (K_LBL), Penalized Logistic Regression (P_LBL), Majority Consensus (MAJ_CON), and Strict Consensus (STR_CON) are shaded in light blue.
Number of TP (True Positives), TN (True Negatives), FP (False Positives), FN (False Negatives) can be counted for each prediction as follows.

The following performance measures can be obtained for 1AZQ:A

Encoding Predictors Accuracy Sensitivity Specificity
Sequence-based BLOSUM62 Support Vector Machine 77.3% 84.2% 74.5%
Kernel Logistic Regression 75.8% 84.2% 72.3%
Penalized Logistic Regression 71.2% 73.7% 70.2%
Majority Consensus 78.8% 84.2% 76.6%
*Strict Consensus 78.9% 86.7% 75.7%
PSSM-based Support Vector Machine 78.8% 84.2% 76.6%
Kernel Logistic Regression 83.3% 94.7% 78.7%
Penalized Logistic Regression 72.7% 47.4% 83.0%
Majority Consensus 80.3% 84.2% 78.7%
*Strict Consensus 84.6% 90.0% 83.3%


Figure 2. BLOSUM62-based Web prediction for protein chain 1AZQ:A, along with actual binding labels appended.





Figure 3. PSSM-based Web prediction for protein chain 1AZQ:A, along with actual binding labels appended.

8. Citation

If you use this web-server, please cite the following articles:

For more info please refer to the Supplementary Information.