Bioinformatics at COMAV

Practical tasks

Determine the platformn and coding

Analyze the format and quality for each of the read files and determine the sequencing platform and version.


You might use guess_seq_format and calculate_stats from seq_crumbs or fastqc. Take into account that the solid reads might be double encoded.

Clean adapters and contaminants

We have some 454 and Illumina reads that might have adapters in them due to the cloning process. Do they really have adapters? These reads also come from a RNA-Seq experiment and might have ribosomic RNA that we should filter out before doing any further analysis. Try to assess and solve these problems with fastqc and seq_crumbs.


Evaluate the quality, look for overrepresented k-mers and take a look at the base composition of the first bases of the reads with fastqc or seq_crumbs.

To remove the adaptor you could use trim_blast_short (from seq_crumbs), cutadapt, scythe or fastx.

To trim the quality you could use trim_quality (from seq_crumbs), fastx, sequence cleaner, or sickle.

To filter the small reads you could use filter_by_length (from seq_crumbs) and to remove the ribosomic reads filter_by_blast (seq_crumbs).

Split 454 matepairs

We have received some 454 titanium mate pair reads. First we need to split both ends of the mate pair for each reads and then we have to trim using a quality criteria, filter out the small read segments (smaller than 40 base pairs). Finally we need to obtain two files with the interleaved forward and reverse fragments of the reads that still have both and another file with the orphaned segments.


In seq_crumbs there are scripts for all steps.

Mithocondrial assembly

We have recived 20000 human mithocondrial reads from a yoruba individual. The reads are Illumina paired end and we want to assemble them:

  1. Reads Quality Assessment.

  2. Clean the reads.

  3. Assemble the reads with velvet.

  4. Calculate statistics for the result.

  5. Take a look at the contigs.


We have already seen how to assemble a mithocondrial genome using velvet.

Map to the mithocondrial assembly

As the result of the mithocondrial assembly that we have just done with we obtained a fasta file with the scaffolds. Map the reads used to create the assembly using bwa to the scaffolds. Once you have the mapping note how many reads had been mapped and open the mapping IGV.

Human Mithocondrial SNP calling

We want to compare the mithocondrial DNA of a yoruba individual with the reference cambrigde mithocondrial sequence. We have 40000 Illumnia reads. Do two SNP callings one with bowtie2 and samtools and another with ngs_backbone and compare them.


Why ngs_backbone does not find any SNP at the first attempt? Modify the backbone.conf parameters related with the quality and look for the SNPs again (e.j. min_quality=21, min_reads_for_allele=10).

You can compare the SNP callings with bedtools or grep.

Take a look at the SNPs with IGV.

Obtain the SNP flanking sequence

Search for the common SNPs of the two previous SNP calls and create a fasta file with 100 pair bases around each SNP.


You can use the intersectBed, flankBed and fastaFromBed tools.

Cucurbita EST SNP calling

We have ordered two 454 and two Illumina reads from a Cucurbita transcriptome. The 454 sequences came from two individuals (indi1 and indi2) and the Illumina came from a mix of different individuals. The sequencing service has send the reads in sanger fastq format.

  1. Do a quality check of the reads provided by the sequencing service.

  2. Clean the reads and check the cleaning process.

  3. Map the reads.

  4. Call the SNPs.

There a walkthroughs for some of these tasks.