High throughput sequencing technologies have become essential in studies on genomics, epigenomics, and transcriptomics. While sequencing information has traditionally been elucidated using a low throughput technique called Sanger sequencing, high throughput sequencing (HTS) technologies are capable of sequencing multiple DNA molecules in parallel, enabling hundreds of millions of DNA molecules to be sequenced at a time. This advantage allows HTS to be used to create large data sets, generating more comprehensive insights into the cellular genomic and transcriptomic signatures of certain diseases and developmental stages.
High-throughput sequencing (HTS) is a newly invented technology alternative to microarray. Although it is still relatively more expensive than microarray, it has several advantages even for the measurements of factors that affect regulation of gene expression.
Next-generation sequencing (NGS), also known as high-throughput sequencing, is the catch-all term used to describe a number of different modern sequencing technologies. HTS methods were developed in the new millennium, and include Illumina, 454 and Ion Torrent sequencing. They are also referred to as Next-Generation Sequencing methods.
Illumina (Solexa) sequencing
Illumina sequencing works by simultaneously identifying DNA bases, as each base emits a unique fluorescent signal, and adding them to a nucleic acid chain.
Roche 454 sequencing
This method is based on pyrosequencing, a technique which detects pyrophosphate release, again using fluorescence, after nucleotides are incorporated by polymerase to a new strand of DNA.
Ion Torrent: Proton / PGM sequencing
Ion Torrent sequencing measures the direct release of H+ (protons) from the incorporation of individual bases by DNA polymerase and therefore differs from the previous two methods as it does not measure light.
HTS has a high error rate, and as HTS generates sequence fragments only, a template must be used such as a known genome sequence to align the sequence fragments to.
Method of NGS
Sample collection
Patient tissue forms the backbone of personalized medicine research. Samples for analysis may originate from formalin-fixed, paraffin-embedded (FFPE) or fresh-frozen samples. With FFPE, sample quality can be compromised by RNA degradation, leading to HTS library construction failure. Microarray platforms have been developed to reliably quantify transcription from FFPE samples.
Sample heterogeneity
Once a sample has been taken from tissue, its composition can be affected by heterogeneity, e.g. in tumour samples, signals may originate from multiple cell types including stroma and immune compartments. This composition varies across samples and has implications for biomarker development, with the potential to confound results. At a bioinformatics level, in silico optimization and/or gene list-based approaches have been applied to separate out signals (termed deconvolution) into their respective cell types. Once stratified into separate cell-type components, standard downstream analyses can follow.
Platform choice
While sample type considerations may impact on platform choice, an overall assessment of an HTS platform’s abilities, relative strengths and weaknesses, from biological, clinical and bioinformatics perspectives, will facilitate the appropriate application of the resultant data. Recent platform examples include the Illumina MiSeq, Ion PGM (Personal Genome Machine), the PacBio RS II and Qiagen Gene Reader (Sequencing-By-Synthesis).
Library preparation
Once a suitable platform has been selected, library preparation, the conversion of nucleic acid materials derived from tissue, etc., into a form suitable for sequencing input, is the next key but potentially a challenging step with biological and bioinformatics implications. Amplification of libraries by polymerase chain reaction (PCR) is prone to introducing bias; although PCR-free methods exist, these too are not challenge-free. Library preparation methods are crucial when only small amounts of DNA be obtained from clinical samples.
Sequencing
There are different HTS approaches depending on the choice of platform, each of which uses bespoke protocols. As such, the output from data from different HTS workflows/platforms can vary. Although primarily a bioinformatics issue, both biologists and clinicians need to be aware of how different protocols can impact results. Within a clinical diagnostic context, accuracy, reproducibility and standardization of HTS results can be improved through focusing on the development of reference standards.
Data analysis and interpretation
As the size and complexity of HTS data increase, the development of new analytical methods is required, optimization for speed and memory usage being key. The recruitment of skilled bioinformaticians, who can develop and manage the most appropriate tools and work within a multidisciplinary context, is crucial. Therefore, training, and standardization of training, in the use of HTS technologies is also key, as recognized by the NGS Trainer Consortium.
Analytical/computational challenges
HTS data sets are both high-dimensional and complex in structure. Integrating such data with other data sets, platforms or technologies, to obtain a complete disease profile, is therefore both algorithmically and computationally challenging. A comprehensive review of meta-omics (integration of independent data sets at the same omics level) and poly-omics (integration of different omics types) algorithmic approaches is presented in Ma and Zhang. Poly-omics projects such as TCGA have applied consensus-based methods to detect connecting patterns between different omics levels, e.g. Cluster of Cluster Assignments