How To Discover a Novel Gene Through BLAST
What is a Novel Gene?
The part of DNA which represents a gene is actually known as an exon. A gene (genomic nucleotide sequence) that has yet to be annotated or described OR has unknown exons in DNA is said to be a novel gene.
So one can say that novelty of a gene is considered as that sequence which has no similarity with other organisms’ genomic nucleotide sequences existing in the database. There are different terms being used for characterizing novel genes; novel gene, orphan gene, de novo gene.
Orphan gene - A gene that is only found inside a single species or a branch and is classified based on its phylogeny.
Novel gene - A gene that has emerged in a specific time frame and is classified based on its age.
De novo gene - A gene that is based on their mechanism of emergence.
How Novel Genes are Identified Traditionally Vs Computationally?
Novel genes and proteins traditionally were recognized by utilizing techniques in biochemistry and molecular biology. For that purpose, the DNA or complementary DNAs were cloned from libraries and then they were sequences using biochemical techniques.
Now computational biologists have provided us with online tools such as BLAST ( basic local alignment search tool) which compares or searches the nucleotide and protein sequences against the genome databases.
Basic Local Alignment Search Tool (BLAST) is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. The query is searched against the entire database of GenBank or the entire datasets which have been incorporated in NCBI. There are four types of BLAST that are summarized in the table given below.
If you want to learn and master your skills in working with BLAST, you can subscribe to our gray bioinformatics plan and take our interactive video lectures right now by visiting BioCode.
How to Find a Novel Gene through BLAST Database Searching?
There are only a few steps one has to go through to discover a novel gene. Those four steps are given below:
1) Start With a Sequence of a Known Protein:
Initiate your process of finding a gene by choosing a sequence of a known protein.
You can choose any known protein for this step. I chose a known protein (BMP-2) of a reptile species (Alligator mississippiensis).
The accession ID of the protein is AFS17419.1
2) Search Through DNA Database (e.g. HTGS, dbEST, or genomic sequence from a specific organism):
In the second step, search your chosen known protein against a DNA database which could be dbEST, HTGS, or genome sequence from a specific organism.
I TBLASTN the chosen known protein (AFS17419.1) against dbEST and a specific organism (Equus caballus).
The results gave the hits (est) which are possibly homologous to the query protein.
3) Find Matches:
Now in this step, observe the given hits by the TBLASTN. See if the given hits are encoding known proteins or related proteins. If they're encoding related protein then it's possibly a novel gene.
My TBLASTN hits given above show that there are many genes encoding the proteins related to the query protein (Max 95.87% identity).
I chose the first five homologs to show here which needs to be further analyzed to confirm if one of them is a novel gene or not.
4) Search Your DNA Or Protein Against a Protein Database (nr to confirm you have identified a novel gene:
Now that you have the gene (DNA sequence) that possibly can be novel. Search it against the non-redundant protein database by BLASTX and analyze the hits. Non-redundant database (nr) contains non-identical sequences from GenBank CDS translations, PDB, Swiss-Prot, PIR, and PRF. The strengths of nr are that it is comprehensive and frequently updated.
I BLASTX the first est (CX594952.1) with the highest identity against the non-redundant (nr) protein database.
The hits given in the figure below show proteins that are encoded by this est.
I observed that there is no protein that is 100% identical to the searched DNA sequence, hence makes it a novel gene.
This tells that this est/DNA sequence (CX594952.1) is novel and encodes a novel protein.
This is how you can confirm your query and see if the gene is novel or not. These are the basic four steps one has to go through to find a novel gene computationally through BLAST database searching. The summary of the result of my finding is given below.
Finding a novel gene through BLAST is easier when you know what species you’re targeting and from which species you’re taking the source protein. You will be able to find a novel gene very easily if your targeting species is distant to your source species and there hasn’t been much work done on it, just like, I chose the source protein from a reptile and searched it against a distant class, a mammal.
If you want to learn more about BLAST searching and phylogenetics, we have many interactive videos available in Gray Bioinformatics plan which you can subscribe right now by visiting https://www.biocode.ltd/