------------------------ BIAS v.1.0 ------------------------- Copyright (c) 2004-2005 by Igor B. Kuznetsov. All Rights Reserved. IKuznetsov (at) albany.edu (please replace (at) with @ to obtain a complete E-mail address) Licensor hereby grants to any person a permission to use, copy and distribute any part of this software program for educational and non-profit purposes, without fee, and without a written agreement, provided that this copyright notice appears in all copies. You cannot modify this source code without an explicit written permission of the author. Any unauthorized commercial distribution of any part of this software program is strictly prohibited. This software is provided on an "AS IS" BASIS and WITHOUT WARRANTY, either express or implied, including, without limitation, the warranties of NON-INFRINGEMENT, MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY OF THE SOFTWARE IS WITH YOU. Limitation of Liability. Under no circumstances and under no legal theory shall the Licensor be liable to any person for any direct, indirect, special, incidental, or consequential damages of any character arising as a result of the use of this software. ------------------------------------------------------------------------ Compiling PBIAS: Extract source files from the tar file: tar -xvf pbias_source.tar Place the following source files in the same directory: pbias.cpp, fsegments_discrete.cpp, math.c, fasta_file.c, fsegments.h, lib.h. If you are using GNU compiler, type: g++ -o pbias pbias.cpp -lm Compiling NBIAS: Extract source files from the tar file: tar -xvf nbias_source.tar Place the following source files in the same directory: nbias.cpp, fsegments_dna.cpp, math.c, fasta_file.c, fsegments.h, lib.h. If you are using GNU compiler, type: g++ -o nbias nbias.cpp -lm NOTE: GNU compiler for Windows can be downloaded from www.mingw.org ------------------------------------------------------------------------ PBIAS usage: The program requires two arguments: name of the input file and name of the output file. Input file must be in FASTA format. If the input file contains more than one sequence, only the first one will be used. Optional arguments: -W:size - size of the scanning window (integer). In this case a sliding window of fixed length `size` will be used instead of linking positions into clusters. Two overlapping or adjacent scanning windows will be merged if the p-value for both is below the threshold given by -P switch. -E:distance - Linkage distance (integer) used to merge positions into clusters. Fragments closer than this number of positions will be merged (4 by default). -S:number - The original sequence will be shuffled `number` times (integer) to estimate the significance of the clusters. -G:number - SPROT (-T) or PDB (-X) frequencies will be used to generate `number` (integer) random sequences used to estimate the significance of the clusters. -T - use SPROT frequencies (default) -X - use PDB frequencies -I - residue frequencies will be estimated from the input sequence. This option can be used to account for the effect of global compositional bias when estimating the significance of local clusters. -F - output masked sequence in FASTA format. Segments with p-value less than the threshold given by -M switch will be masked by X`s. -L - same as -F but segments will be masked using lower-case letters. -N - do not merge overlapping/adjacent scanning windows. -K - estimate linkage distance from the expected number of occurrences. In this case the linkage distance, d, is estimated as round(2./(3.*p)), where p is the probability of observing a residue from the sub-alphabet given by -A switch. -A:string - amino acid sub-alphabet (default: GPSTDNH). -P:threshold - min. p-value for merging adjacent scanning windows (0.05 by default). Used in conjunction with -W switch. -M:threshold - min. p-value for masking FASTA output (0.05 by default). Example: PBIAS test.fas test_out -E:5 -X -F -A:ILVF -M:0.01 In this case, sequence in the input file `test.fas` will be analyzed. All segments will be reported in the output file `test_out.seg`. The program will search for clusters of residues I,L,V and F. Linkage distance of 5 and PDB frequencies will be used. The input sequence with masked segments that have p-value <= 0.01 will be reported in file `test_out.fas`. Segments will be masked with Xs. ------------------------------------------------------------------------ Format of the output file *.SEG: First two lines give the arguments used to run the program: sequence length, linkage distance, sub-alphabet, type of residue frequencies, probability of observing a residue from the sub-alphabet. Then, the input sequence in FASTA format is printed. The last part gives a table with all clusters of the residues from the user-supplied sub-alphabet (defined with -A switch) found in the input sequence. First column gives the number of the segment (from 1 to n, where n is the total number of segments). Second and third columns give start and end positions of the segment. Fourth column gives the length of the segment. Fifth column gives the number of residues from the sub-alphabet found in the segment. Sixth column gives the p-value from Eq.2. At the end of each row the sequence of the segment is printed. Last line in the file gives the estimates of the global compositional bias (Eq.11). It shows the total number of residues from the sub-alphabet found in the input sequence and two p-values. First is the exact p-value from the binomial distribution (Eq.12), second is the p-value obtained using the normal approximation to the binomial distribution (in certain cases of very long sequences the exact estimate cannot be computed due to overflow errors). (+) or (-) sign at the end of the line denotes excess or lack of the residues from the sub-alphabet. ------------------------------------------------------------------------ NBIAS usage: The same as PBIAS, except for the following switches: -R:string - user-supplied sub-alphabet of residue types -A:decimal number - probability of A -T:decimal number - probability of T -G:decimal number - probability of G -C:decimal number - probability of C -B:number - generate `number` (integer) random sequences used to estimate the significance of the clusters. By default, NBIAS estimates residue frequencies from the input sequence. ----------------------------------------------------------------------- A companion Perl script, MPBIAS.PL, is provided to perform masking of homopolymer tracts in a protein sequence. The approach is similar to that implemented in CAST algorithm (Promponas et al, 2000, Bioinformatics, 16(10):915-922). This script searches for sub-sequences in which one of the 20 amino acid types is significantly over-represented and then replaces all positions in such sub-sequences with the ‘X’ character. Type 'perldoc mpbias.pl' to see command line arguments for the script. PLEASE CITE THE FOLLOWING ARTICLE: I.Kuznetsov and S.Hwang (2006) A novel sensitive method for the detection of user-defined compositional bias in biological sequences. Bioinformatics, 22(9):1055-1063