Phylogenetic trees are constructed using certain data derived from studies on homologous traits, analagous traits, and molecular evidence that can be used to establish relationships using polymeric molecules ( DNA, RNA, and proteins ).
Basic steps
Data selection – Amino acid or nucleotide
In the case of a gene phylogeny, we need to decide if we want to work with nucleotide or amino acid data.
We can use either amino acid or nucleotide data to generate a tree.
Some argue that it is better to use amino acid data because the redundancy of the genetic code means our will be able to recover more conserved sites in our alignment. However, any analysis we perform with amino acid data is more time consuming in comparison to its nucleotide counterpart. This is because there are 20 possible amino acids substitutions, as opposed to only 4 nucleotide substitutions.
Other scientists prefer to use nucleotide data. As mentioned above, nucleotide analyses are faster. In addition, nucleotide data has more information that can be used to recognize the evolution of your sequence since 3 nucleotides code for 1 amino acid.
Alignment
Alignment programs shift our data by inserting gaps to line up all the homologous (or conserved) sites into vertical columns. There are many alignment programs, the most common and well-supported are,
MUSCLE
MAFFT
Mesquite
It is best to try at least 2 different parameters, if not more, and then view our alignment to determine which is better
A good program for visualizing our alignment, and converting it into different file formats (e.g. Nexus, PHYLIP, etc.) is Mesquite.
Model Selection
Our phylogenetic tree will be more accurate when we use the correct model of evolution. Models consist of many parameters that calculate the substitution rates of our data. In other words, a program predicts which model’s algorithm best captures the way our data set is evolving or changing. This model is used later to build our tree.
When using nucleotide data, use jModelTest. For amino acid data, submit our jobs to the ProtTest server. For both, response time will vary depending on the quantity and divergence of your sequences.
Once the model test has been performed, look at the output and select the model with the lowest AIC (Akaike Information Criterion) and/or BIC (Bayesian Information Criterion). The lower AIC/BIC value means less data is predicted to be missing under this specific model.
Tree building
Maximum likelihood (ML) assumes the best tree is the tree that is most likely with the given data, under a certain model. ML will take into account all the data we have generated so far in order to construct our final tree. It is a commonly used tree-building algorithm that will give us a single tree as our output.
Possible ML servers with interfaces to submit your job:
RaxML
PhyML or
GARLI
Making it pretty
When we have created our tree, then it’s time to make it publication ready.
To visualize, re-root and perform minor edits to our tree, use FigTree.
If we need to change the taxa names, font, or size, use Adobe Illustrator or a similar image manipulation program. Make sure our taxa names can be clearly read and the bootstrap values are visible above each node.
Not all data will require such robust analysis. But we will not know for certain how much better or different a tree produced from a more robust analysis will be until this analysis is performed.
Tree-Building Methods
Distance-Based Methods
UPGMA Method
Neighbor Joining Method (NJ)
Weighted Neighbor-Joining (Weighbor)
Fitch-Margoliash (FM) and Minimum Evolution (ME) Methods
Character-Based Methods
Maximum parsimony (MP)
Maximum Likelihood (ML)
Rooting trees method