A major goal of annotation would be to describe those sequences, and eventually determine how universal those sequences are in the promoter of specific genes. The first step is to describe such sequences in a reference species and use that information for further.
The process of identifying and analyzing the raw sequence of a genome, determining what those genes do, and describing relevant genetic and genomic features such as genes, mobile elements, repetitive elements, duplications, and polymorphisms. Once a genome is sequenced, it needs to be annotated to make sense of it.
Structural annotation consists of the identification of genomic elements.
ORFs and their localization
gene structure
coding regions
location of regulatory motifs
Functional annotation consists of attaching biological information to genomic elements.
biochemical function
biological function
involved regulation and interactions
expression
It is necessary because the sequencing of DNA produces sequences of unknown function. In the last three decades, genome annotation has evolved from the computational annotation of long protein-coding genes on single genomes (one per species), and the experimental annotation of short regulatory elements on a small number of them, into the population annotation of sole nucleotides on thousands of individual genomes (many per species).
Genome annotation consists of describing the function of the product of a predicted gene (through an in silico approach). This can be achieved using bioinformatics software with specific features, including
(1) signal sensors (e.g., for TATA box, start and stop codon, or poly-A signal detection)
(2) content sensors (e.g., for G+C content, codon usage, or dicodon frequency detection)
(3) similarity detection (e.g., between proteins from closely related organisms, mRNA from the same organism, or reference genomes)
Genome annotation consists of three main steps:
identifying portions of the genome that do not code for proteins
identifying elements on the genome, a process called gene prediction, and
attaching biological information to these elements.
Annotation Approaches
Nucleotide annotation: The first step of nucleotide annotation is to find a sequence that has the features of a gene. Many eukaryotic genes contain specific features, such as introns that separate exons, that can serve are markers for the discovery process. Therefore, it is important to develop a software program that properly recognizes such features. A number of programs are available that perform these searches. A key feature of each of these programs are sensor algorithms that identifies the key structural features. The program might also include other sensors that detect a transcriptional start site or recognize specific GC content.
Naming the genes: Once a sequence has been defined as a gene, the next step is to name it. The naming of genes relies upon the significant amount of research that predated genome projects.
Non-gene RNA sequences. Programs are also available that search for non-gene RNA sequences that are important components of the genome. These sequences include the ribosomal RNAs and tRNAs that are essential for protein translation. In addition, the small nuclear RNAs important to processes such as RNA splicing are necessarily components of the gneome.
A novel search for controlling element motifs. All genes are controlled by sequences upstream of the transcriptional start site. A number of the sequences are important because they represent the site to which transcription factor, proteins that control gene expression, bind.