top of page

Genome Assembly

BioCodeKb - Bioinformatics Knowledgebase

Genome assembly means the process of taking a large number of short DNA sequences and putting them back together to create a representation of the original chromosomes from which the DNA originated

Genome assembly is the computational process of solving the sequence composition of the genetic material (DNA) within the cell of an organism.

The genome assembly is simply the genome sequence produced after chromosomes have been fragmented, those fragments have been sequenced, and the resulting sequences have been put back together. Each species in Ensembl has a reference genome assembly that is produced by an international genome consortium. The reference assembly can be compiled from the DNA of one individual, a collection of individuals, a breed or a strain. This depends on the species. Genome assembly is updated when DNA has been sequenced that allows gaps to be filled. It may also be updated when a new assembling algorithm is released. This work is done by external groups, who submit the updated assembly to the INSDC.

Genome assembly from sequence reads is an algorithm-driven automated process. DNA-sequence-assembly programs have utilized sequence overlaps for sequence assembly in correct order. Sequence assembly can be done using one of three approaches:

  • Greedy approach

  • overlap-layout-consensus (OLC) and Hamiltonian path

  • de Bruijn graph and Eulerian path

There are two basic steps of assembly;

  1. the creation of unique regions of contiguous (contigs) assemblies from the sequence overlaps between reads

  2. the sequential end-to-end organization of contigs by employing the mate pair information contained in the reads constituting each contig.

The successful application of this strategy was a significant challenge for the sequencing and assembly of human DNA since greater than 45% of the genome sequence is repetitive in nature. The repetitive nature of the human and other mammalian genomes thus confounds the ability to determine accurate read sequence overlaps and the placement of contigs into the correct order and orientation. Our solution is to employ clone insert libraries of a size larger than the corresponding repeat regions, typically either fosmid or BAC libraries, thus enabling the spanning of repeats by assembling adjacent unique sequence. The selection of a range of insert sizes libraries is typically built into the experimental design when sequencing a new genome.

Genome assembly involves taking smaller fragments, called “reads,” and assembling them together to form a cohesive unit, called the “sequence.” However, simply assembling all the reads into one contiguous sequence, a “contig,” is not enough. One has to ensure that the assembled sequence does indeed resemble what is truly present in the cell. Some common hurdles are low coverage areas, false positive read-read alignments, false negative alignments, poor sequence quality, polymorphism, and repeated regions of the genome. An even more fundamental concern lies in the difficulty of determining which of the two strands was finally reported in the sequencing procedure. Moreover, as a number of research domains draw suitable conclusions from the sequence itself, a sequence that has not been reported accurately may potentially affect resulted analyses.


Need to learn more about Genome Assembly and much more?

To learn Bioinformatics, analysis, tools, biological databases, Computational Biology, Bioinformatics Programming in Python & R through interactive video courses and tutorials, Join BioCode.

bottom of page