The multiple sequence alignments is built up in essentially three steps (MaxHom, Sander & Schneider, Proteins, 1991, 9, 56-68).
- The protein database (currently SWISS-PROT) is searched by a fast alignment program (currently BLASTP).
- In sweep 1, sequences are aligned consecutively to the search sequence by a standard dynamic programming method. After each sequence has been added a profile is compiled, and used to align the next sequence.
- In sweep 2, after all sequences with significant homology have been picked from the BLASTP output, the profile is recompiled, and the dynamic programming algorithm starts once again to align consecutively the sequences, this time using the conservation profile as derived after completion of sweep 1.
PSIblast is a fast, yet sensitive database search program.
We are running the iterated PSI-BLAST on a subset of the BIG database with SWISS-PROT + TrEMBL + PDB sequences. The number of iteration, the cut-off thresholds and the particular details of which sequences are used from BIG has been optimised in our group.
Secondary structure is predicted by a system of neural networks rating at an expected average accuracy > 78% for the three states helix, strand and loop (Rost, 2000, unpublished). Evaluated on the same data set, PROFsec is rated at 6-8 percentage points higher three-state accuracy than PHDsec.
Transmembrane helices in integral membrane proteins are predicted by a system of neural networks. The shortcoming of the network system is that often too long helices are predicted. These are cut by an empirical filter. The final prediction (Rost et al., Protein Science, 1995, 4, 521-533; evaluation of accuracy) has an expected per-residue accuracy of about 95%. The number of false positives, i.e., transmembrane helices predicted in globular proteins, is about 2% (Rost et al. 1996).
The neural network prediction of transmembrane helices (PHDhtm) is refined by a dynamic programming-like algorithm. This method resulted in correct predictions of all transmembrane helices for 89% of the 131 proteins used in a cross-validation test; more than 98% of the transmembrane helices were correctly predicted. The output of this method is used to predict topology, i.e., the orientation of the N-term with respect to the membrane. The expected accuracy of the topology prediction is > 86%. Prediction accuracy is higher than average for eukaryotic proteins and lower than average for prokaryotes. PHDtopology is more accurate than all other methods tested on identical data sets (Rost, Casadio & Fariselli, 1996a and 1996b; evaluation of accuracy).
Euclid is a tool for the automatic classification of sequences in functional classes using their database annotations. The Euclid system is based on a simple learning procedure from examples provided by human experts (Tamames, J., Ouzounis, C., Casari, G., Sander, C. & Valencia, A. (1998). EUCLID: automatic classification of proteins in functional classes by their database annotations. Bioinformatics, 14, 542-3.)
The following description is from the original COILS site:
COILS is a program that compares a sequence to a database of known parallel two-stranded coiled-coils and derives a similarity score. By comparing this score to the distribution of scores in globular and coiled-coil proteins, the program then calculates the probability that the sequence will adopt a coiled-coil conformation.
The following description is from the original SEG documentation (JC Wootton & S Federhen, 1996, Meth Enzymology, 266, 554-571):
SEG divides sequences into contrasting segments of low-complexity and high-complexity. Low-complexity segments defined by the algorithm represent "simple sequences" or "compositionally-biased regions".
Locally-optimized low-complexity segments are produced at defined levels of stringency, based on formal definitions of local compositional complexity. The segment lengths and the number of segments per sequence are determined automatically by the algorithm.
PEP CUBIC
A>