# Clustering Based Method (Tree Construction)

# BioCodeKb - Bioinformatics Knowledgebase

The most popular and frequently used methods of tree building can be classified into two major categories: phenetic methods based on distances and cladistic methods based on characters. The former measures the pair-wise distance/dissimilarity between two genes, the actual size of which depends on different definitions, and constructs the tree totally from the resultant distance matrix. The latter evaluate all possible trees and seek for the one that optimizes the evolution.

The **construction** of graphical phylogenetic **tree** reveals similarity as well as dissimilarity among organisms [4]. To make a **cluster** or group of similar organism **in** the **tree**, various **clustering methods** are applied. Such observations also make the appearance of organism **in** relation to time.

**UPGMA Method**

This method follows a clustering procedure:

(1) Assume that initially each species is a cluster on its own.

(2) Join closest 2 clusters and recalculate distance of the joint pair by taking the average.

(3) Repeat this process until all species are connected in a single cluster. Strictly speaking, this algorithm is phenetic, which does not aim to show evolutionary descent. It assigns equal weight on the distance and assumes a randomized molecular clock. WPGMA is a similar algorithm but assigns different weight on the distances. UPGMS method is simple, fast.

The basic assumption of the UPGMA method is that all taxa evolve at a constant rate and that they are equally distant from the root, implying that a molecular clock is in effect. However, real data rarely meet this assumption. Thus, UPGMA often produces erroneous tree topologies. However, owing to its fast speed of calculation, it has found extensive usage in clustering analysis of DNA microarray data.

**Neighbor Joining Method (NJ)**

The neighborjoining (NJ) method can be used, which is somewhat similar to UPGMA in that it builds a tree by using stepwise reduced distance matrices. However, the NJ method does not assume the taxa to be equidistant from the root. It corrects for unequal evolutionary rates between sequences by using a conversion step. This conversion requires the calculations of “r-values” and “transformedr-values”.

This algorithm does not make the assumption of molecular clock and adjust for the rate variation among branches. It begins with an unresolved star-like tree. Each pair is evaluated for being joined and the sum of all branches length is calculated of the resultant tree. The pair that yields the smallest sum is considered the closest neighbors and is thus joined .A new branch is inserted between them and the rest of the tree and the branch length is recalculated. This process is repeated until only one terminal is present. NJ method is comparatively rapid and generally gives better results than UPGMA method. But it produces only one tree and neglects other possible trees, which might be as good as NJ trees, if not significantly better. Moreover, since errors in distance estimates are exponentially larger for longer distances, under some condition, this method will yield a biased tree.

**Generalized Neighbor Joining**

One of the disadvantages of the NJ method is that it generates only one tree and does not test other possible tree topologies. To overcome the limitation, a generalized NJ method has been developed, in which multiple NJ trees with different initial taxon groupings are generated. A best tree is then selected from a pool of regular NJ trees that best ﬁt the actual evolutionary distances.