Bioinformatics at COMAV

NGS file formats

There are lots of file formats related to NGS analyses. The most common ones are:

_images/ngs_map_read_file_formats.png _images/ngs_file_formats.png

Sequence file formats

The different sequence related formats include different information about the sequence. The most common file formats in the NGS world are: fastq and sff.


The SFF (Standard Flowgram Format) files are the 454 equivalent to the ABI chromatogram files. They hold information about:

  • the flowgram,
  • the called sequence,
  • the qualities,
  • and the recommended quality and adaptor clipping.

These recommended clippings are given by the 454 sequencer. The Roche software takes into account the quality and the adaptor sequence to recommend a clipping for each sequence. Like the ABI files, these are binary files that should be opened with specialized programs. There are several tools to extract the sequences and to convert them to a more usable format. Roche provides one executable able to do it with the 454 machine. Alternatively we can use the sff_extract tool to obtain a fasta file.


The fasta format is based on a simple text. Each sequence starts with a “>” followed by the sequence name, an space and, optionally, the description

>seq_1 description

Usually, if we have quality information, another fasta file with the quality information could be provided. In this cases both the sequence and the quality file should have the sequences in the same order.

>seq_1 description
54 57 54 57 48 48 48 48 57 57 57 47 47 41 42 41 47 57 57 57 57 47 44 44 44 44 50 50
54 57 57 46 43 37 44 43 57 37 37 37 57 57 57 57 52 52 52 52 57 50 47 47 52
52 47 52 52 50 50 50 50 50 57 57 54 57 57 57 57 57 57 57 46 46 57 57 57 57 57 57 57
57 57 57 57 57 57 57 57 57 57 57 57 29 29

sanger fastq

The fastq format was developed to provide a convenient way of storing the sequence and the quality scores in the same file. These are text files and they look like:


In this file every sequence has 4 lines. In the first line we get the sequence name after the symbol “@” and, optionally, the description. The second line has the sequence and the fourth line has the quality scores encoded as letters.

Quality coding (modified from wikipedia).

|                         |    |        |                              |                     |
33                        59   64       73                            104                   126

S - Sanger        Phred+33,  raw reads typically (0, 40)
X - Solexa        Solexa+64, raw reads typically (-5, 40)
I - Illumina      Phred+64,  raw reads typically (0, 40)

Illummina fastq

This file is almost identical to a sanger fastq file, but the encoding for the quality scores is different. When we deal with a fastq file we have to be sure about which kind of file we are dealing with, an illumina fastq or a sanger fastq. Unfortunately they are not easy to differentiate. Also you have to take into account that solexa used to had a third fastq format, the solexa fastq, although this one is mostly obsoleted. Recently Illumina has also decided to distribute its files as Sanger fastq, so the Illumina fastq will be not used any more.

One of the seq_crumbs utilities, guess_seq_format, is able to differentiate the Sanger from the Illumina version by looking for quality characters exclusive of the Sanger version.


SRA is the file format in which all NCBI SRA content is provided. SRA files are binary files and we need specific tools to extract the information. There is a toolkit (SRA Toolkit)developed by NCBI to deal with these binary files.

Compressed files

Sometime these sequence text file can be found compressed to save up hard drive space. The most common compression formats are gzip and bgzip. bgzip is a gzip variant commonly used in genomics because, although it is a little less efficient in the compression ratio, it allows random access. Most software is becoming compatible with these formats.

Paired files

It is common to obtain two reads from a single molecule. Examples of these techniques are the Illumina pair-ends and mate-pairs. In this cases for each read there is another paired read. One common way to store those paired reads is to create to fastq files, one for the first read of the pairs and another one for the second. In this case the files should hold the reads exactly in the same order.

Fastq file 1
@molecule_1 1st_read_from_pair
@molecule_2 1st_read_from_pair
@molecule_3 1st_read_from_pair

Fastq file 2
@molecule_1 2nd_read_from_pair
@molecule_2 2nd_read_from_pair
@molecule_3 2nd_read_from_pair

Another option is to interleave the reads in a single file alternating the first and second read for each pair.

Interleaved Fastq file
@molecule_1 1st_read_from_pair
@molecule_1 2nd_read_from_pair
@molecule_2 1st_read_from_pair
@molecule_2 2nd_read_from_pair
@molecule_3 1st_read_from_pair
@molecule_3 2nd_read_from_pair

Depending on the software that we want to use we should the interleaved or the two file version. In seq_crumbs there are programs to convert between one option and the other.

Practical tasks:

Transforming between formats

1.- SFF to fastq. We have received an SFF file from a sequencing provider and we need to extract the sequence and quality. Download the file and store it in a folder named practice. We could use sffinfo from the 454 software or the seq_crumb sff_extract.

ngs_user@ngsmachine:~/sff_extract -h
usage: sff_extract [-h] [-o OUTPUT] [-c] [--min_left_clip MIN_LEFT_CLIP]
                 [--max_percentage MAX_PERCENT] [--version]
                 [input [input ...]]

SFF binary file reads extractor.

positional arguments:
  input                 SFF input files to process

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        output file (default STDOUT)
  -c, --clip            Do recommended trims
  --min_left_clip MIN_LEFT_CLIP
                        Clip at least this number of nucleotides (default: 0)
  --max_percentage MAX_PERCENT
                        nucleotide abundance to consider a SFF file as
                        "strange" (default: 50.0)
  --version             show program's version number and exit

We can use sff_extract is working to extract the sequence and quality. Open a new terminal and go to the practice directory:

ngs_user@machine:~$ mkdir -p practice
ngs_user@ngsmachine:~$ cd practice/
ngs_user@ngsmachine:~/practice$ pwd
ngs_user@ngsmachine:~/practice$ ls

Now we can extract the reads:

ngs_user@ngsmachine:~/practice$ sff_extract -o 10_454_reads.fastq 10_454_reads.sff

WARNING: weird sequences in file 10_454_reads.sff

After applying left clips too many reads start with:

This does not look sane.

Countermeasures you *probably* must take:
1) Make your sequence provider aware of that problem and ask whether this can be
    corrected in the SFF.
2) If you decide that this is not normal and your sequence provider does not
    react, use the --min_left_clip of sff_extract.
    (Probably '--min_left_clip 5' but you should cross-check that)

At this point a fastq file should be located in the practice directory. Also sff_extract has warned us that too many reads start with the same sequence, in this case an “A”. We could remove an extra base to solve that problem:

ngs_user@ngsmachine:~/practice$ sff_extract -o 10_454_reads.fastq --min_left_clip 5 10_454_reads.sff

Now the sequences should have an extra base masked. If we want we could trim these bases instead of masking them:

ngs_user@ngsmachine:~/practice$ sff_extract -o 10_454_reads.fastq --min_left_clip 5 --clip 10_454_reads.sff

Working with paired reads

Download the paired seqs pairs

Uncompress the files:

ngs_user@ngsmachine:~/$ tar zxvf pairs.tar.gz

ngs_user@ngsmachine:~/$ cd pairs

ngs_user@ngsmachine:~/pairs$ gunzip pairs_1.fastq.gz
ngs_user@ngsmachine:~/pairs$ gunzip pairs_2.fastq.gz
ngs_user@ngsmachine:~/pairs$ ls
pairs_1.fastq  pairs_2.fastq

Imagine you need an interleaved pair file to start assembling your reads. in seq_crumbs we have a couple of binaries to deal with it: interleave_pairs and deinterleave_pairs:

ngs_user@ngsmachine:~/pairs$ interleave_pairs  -h
usage: interleave_pairs [-h] [-t IN_FORMAT] [-o OUTPUT] [--version]
                        [-z  | -Z  | -B ] [-s]
                        [input [input ...]] input input

positional arguments:
  input                 Sequence input files to process (default STDIN)
  input                 Sequence input files to process.

optional arguments:
  -h, --help            show this help message and exit
  -t IN_FORMAT, --in_format IN_FORMAT
                        Format of the input files (default: guess)
  -o OUTPUT, --outfile OUTPUT
                        Sequence output file (default: STDOUT)
  --version             show program's version number and exit
  -z , --gzip           Compress the output in gzip format
  -Z , --bgzf           Compress the output in bgzf format
  -B , --bzip2          Compress the output in bzip2 format
  -s, --skip_checks     Skip the pair read name correspondence checking.

ngs_user@ngsmachine:~/pairs$ interleave_pairs pairs_1.fastq pairs_2.fastq -o pairs.fastq

ngs_user@ngsmachine:~/pairs$ head -n 20 pairs.fastq
@HWI-ST1203:122:C130PACXX:4:1101:1845:1988 1:N:0:CAGATC
#1=DDFFFHHHHHIJJJJJBEHHGICIIDCEHHHEIFC?  @B@0:@578BD###################################################
@HWI-ST1203:122:C130PACXX:4:1101:1845:1988 2:N:0:CAGATC
#1=DDFFFHHHHHIJJJJJBEHHGICIIDCEHHHEIFC?  @B@0:@578BD###################################################
@HWI-ST1203:122:C130PACXX:4:1101:2114:1985 1:N:0:CAGATC
@HWI-ST1203:122:C130PACXX:4:1101:2114:1985 2:N:0:CAGATC

Now split the same file in two separated files:

ngs_user@ngsmachine:~/pairs$ deinterleave_pairs -o new_pairs_1.fastq new_pairs_2.fastq  pairs.fastq

Finally you can compress the paired files:

ngs_user@ngsmachine:~/pairs$ gzip new_pairs_1.fastq
ngs_user@ngsmachine:~/pairs$ gzip new_pairs_2.fastq

SRA toolkit

Download the sra file (file info) and use fastq-dump to convert to fastq format.

First, you can run the program with the default options and look at the output file:

ngs_user@ngsmachine:~$ fastq-dump SRR2970642.sra

Is this useful? Taking into account that the reads are paired reads, can we improve the convertion?:

ngs_user@ngsmachine:~$ fastq-dump --split-files --defline-qual "+" --defline-seq '@$sn/$ri' --gzip SRR2970642.sra
| | index