In the area of DNA sequencing, the FASTQ file format has emerged as another de facto common format for data exchange between tools. It provides a simple extension to the FASTA format. This is very minimal representation of a sequencing read that nothing about the relative levels of the four nucleotides is captured nor did this in any way attempt to deal with flow or colour space data as in ABI SOLID.
FASTQ has emerged as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score, despite lacking any formal definition to date, and existing in at least three incompatible variants.
It was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA formatted sequence and its quality data, but has recently become the de facto standard for storing the output of high-throughput sequencing instruments such as the Illumina Genome Analyzer. FASTQ files are mostly used to store short-read data from high-throughput sequencing experiments. FASTQ files can contain up to millions of entries and can be many megabytes or gigabytes in size, which often makes them too large to open in a normal text editor.
The FASTQ format employs the following standard IUB/IUPAC conventions for encoding protein or nucleic acid sequences as alphabetic characters.
FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. The PHRED quality score QPHRED of an individual nucleotide base represents the error probability Pe of a given nucleotide base call is incorrect.
Phred scores are now a de facto standard for representing sequencing base qualities. The Phred software reads DNA sequencing trace files, calls bases and assigns a quality value to each.
The quality scores are generated in binary base call (BCL) files from Illumina sequencing platforms, which are then later converted to FASTQ files using bcl2fastq tool. The quality score is an integer (Q) which is typically in the range 2 - 40, but higher and lower values are sometimes used.
The original Sanger FASTQ files also allowed the sequence and quality strings to be wrapped (split over multiple lines), but this is generally discouraged as it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).
A FASTAQ file normally uses four lines per sequence.
Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a fasta title line).
Line 2 is the raw sequence letters (Nucleotide bases: A, T, G, C or Uncalled base: N).
Line 3 begins with a ‘+’ character that is the sequence separator and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values, for space-efficient encoding, for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
There is no standard file extension for a FASTQ file, but .fq and .fastq, are commonly used.