TOOLS FOR PREDICTION AND ANALYSIS OF PROTEIN-CODING GENE STRUCTURE
GenView 2: A computing system for protein-coding gene prediction
Prediction of human protein-coding genes in newly sequenced DNA becomes very important in large genome sequencing projects. This problems is complicted due to exon-intron of the eukaryotic genes. All classical methods of splice signals and coding regions prediction fail to bring about a well reliable identification of the gene structure. The exon prediction methods miss most of short exons and cannot reliably define the exon/intron boundaries, while the splice site prediction discovers some of real splice sites along with a great body of false sites. The most realistic seems the combined approach which is based on the using of information about potential splice sites in combination with the coding potential. Based on this approach several methods had been created for the coding regions prediction (for review see Milanesi et al.).
GenView (Milanesi et al., 1993) system is based on prediction of splice signals by classification approach and coding regions by dicodon statistic. Potential gene structure is constructed using dynamic programming approach. The main goal of the system in comparison with other systems is to minimize the number of nonreal exons in analyzed DNA (overprediction). GenView has two modes: exon and gene predictions. It was adopted for human, mouse and Diptera sequences. Sequencing errors (frameshifts and substitutions resulting in stop codons) revealing mode can help in analysis of sequences containing artifactual nucleotide substitutions, insertions and deletions.
For human sequences (Milanesi et al., 1993) Best EXON prediction mode has very small overprediction -less than 1% (Specificity > 99%). But about 70% of true exons was lost (underprediction) (Accuracy is about 30%). If any exon was predicted, the probability that this is real exons is large. Best GENE prediction mode has larger value of overprediction - about 4% (Specificity about 95%). About 30% of true exons was lost (Accuracy is 70%).
For Diptera sequences (Rogozin et al., 1995) Best EXON prediction mode has very small error of overprediction - less than 2% (Specificity > 98%). But about 50% of true exons was lost (underprediction) (Accuracy is about 50%). If any exon was predicted, the probability that this is real exons is large. Best GENE prediction mode has larger value of overprediction - about 13% (Specificity about 87%). About 27% of true exons was lost (Accuracy is 73%).
Milanesi L., Rogozin I.B. Prediction of human gene structure. In: Guide to Human Genome Computing (2nd ed.) (Ed. M.J.Bishop) Academic Press, Cambridge, 1998, 215-259.
Milanesi, L., Kolchanov, N.A., Rogozin, I.B., Ischenko, I.V., Kel, A.E., Orlov, Yu.L., Ponomarenko, M.P., Vezzoni, P. (1993) GenView: a computing tool for protein-coding regions prediction in nucleotide sequences.In: "Proceedings of the Second International Conference on Bioinformatics, Supercomputing and Complex Genome Analysis" (H.A. Lim, J.W. Fickett, C.R. Cantor and R.J. Robbins, eds.), World Scientific Publishing, Singapore, pp. 573-588.
Rogozin I.B., Kolchanov N.A. and Milanesi L. A computing system for protein-coding regions prediction in Diptera nucleotide sequences // Drosophila Information Service, 1995, v.76, 185-187.