The quality of automated gene prediction in microbial organisms has improved steadily over the past decade, but there is still space for improvement. Increasing the number of correct identifications, both of genes and of the translation initiation sites for each gene, and reducing the overall number of false positives, are all desirable goals.
With the years of experience in manually curating genomes for the Joint Genome Institute, we developed a new gene prediction algorithm called Prodigal (PROkaryotic DYnamic programming Gene-finding ALgorithm). Prodigal focused specifically on the three goals of improved gene structure prediction, improved translation initiation site recognition, and reduced false positives.
Prodigal is known to be a very fast gene recognition tool and a highly accurate gene finder which performs well also with high GC content genomes. Prodigal is based on log-likelihood functions and does not use Hidden or Interpolated Markov Models.
Features
Prodigal provides fast, accurate protein-coding gene predictions in GFF3, Genbank, or Sequin table format.
Prodigal runs smoothly on finished genomes, draft genomes, and metagenomes.
Prodigal analyzes the E. coli K-12 genome in 10 seconds on a modern MacBook Pro.
Handles gaps and partial genes
Prodigal predicts the correct translation initiation site for most genes, and can output information about every potential start site in the genome, including confidence score, RBS motif, and much more.
Prodigal is an extremely fast gene recognition tool that can analyze an entire microbial genome in 30 seconds or less.
It correctly locates the 3' end of every gene in the experimentally verified Ecogene data set
Prodigal's false positive rate compares favorably with other gene identification programs, and usually falls under 5%.
Prodigal performs well even in high GC genomes, with over a 90% perfect match (5'+3') to the Pseudomonas aeruginosa curated annotations.
Prodigal can run in metagenomic mode and analyze sequences even when the organism is unknown.
The basic steps of the Prodigal algorithm can be summarized as follows:
Due to the lack of A and T in high GC genomes, there are many fewer stop codons. Long ORFs occur simply by chance in high GC genomes, and many of them aren't real genes at all. Prodigal addresses this problem with GC frame plot based training, wherein it examines all the ORFs in a genome looking for a bias for G or C in the 1st, 2nd, and 3rd positions of each codon.
Prodigal gathers dicodon (hexamer) statistics for all the genes in its initial dynamic programming model.
Once Prodigal has scored all potential candidates in a given ORF, it then uses a "sharpening" of the coding score, wherein it penalizes all potential start candidates that lie downstream from a higher-scoring start.
A static length factor is added to the coding score. This factor is higher in low GC genomes, and lower in high GC genomes.
For every open reading frame containing a gene with a coding score above a certain threshold, the translation initiation site with the highest coding score is recorded.
A final dynamic programming is performed over the set of all start-stop pairs in the genome.