Procedural and OOP excersises
DNA sequence: procedural
Procedural approach
Create functions to:
- calculate sequence length
- get the fastq and fasta representations of a sequence
- reverse and complement a sequence.
The sequence should be a tuple with two (name, and sequence) or three (name, sequence, and quality) fields.
Here you have a test suite that should pass once you have implemented the functionality:
```{python}
import unittest
class SequenceTest(unittest.TestCase):
def test_seq_len(self):
seq = ("seq1", "ATGC")
assert calc_len(seq) == 4
seq = ("seq2", "ATGC", [10, 20, 30, 40])
assert calc_len(seq) == 4
seq = ("seq2", "ATGC", [10, 20, 30])
try:
assert calc_len(seq) == 4
self.fail("Should have raised a ValueError")
except ValueError:
pass
def test_to_fastq(self):
seq = ("seq1", "ATGC")
try:
assert to_fastq(seq)
self.fail("Should have raised a RuntimeError")
except RuntimeError:
pass
seq = ("seq2", "ATGC", [10, 20, 30, 40])
assert to_fastq(seq) == "@seq2\nATGC\n+\n+5?I\n"
def test_to_fasta(self):
seq = ("seq1", "ATGC")
assert to_fasta(seq) == ">seq1\nATGC\n"
seq = ("seq2", "ATGC", [10, 20, 30, 40])
assert to_fasta(seq) == ">seq2\nATGC\n"
def test_reverse(self):
seq = ("seq1", "aTGC")
seq = reverse(seq)
assert seq[0] == "seq1_rev"
assert seq[1] == "GCAt"
seq = ("seq2", "ATGC", [10, 20, 30, 40])
seq = reverse(seq)
assert seq[1] == "GCAT"
assert seq[2] == [40, 30, 20, 10]
if __name__ == "__main__":
unittest.main()
```
Object oriented approach
Create a Sequence class with the following properties:
- name
- sequence
- quality, this should be an array of integers from 0 to 60. Quality should be optional, it could be None.
The class should have the methods to calculate:
- the length of the sequence
- to create a reverse and complementary sequence
- to_fastq: it returns a str with the fastq representation of the sequence
- to_fasta: it returns a str with the fasta representation of the sequence
Here you have a test suite that should pass once you have implemented the functionality:
```{python}
import unittest
class SequenceTest(unittest.TestCase):
def test_seq_creation(self):
seq = Seq("seq1", "ATGC")
assert seq.name == "seq1"
assert seq.seq == "ATGC"
assert seq.qual is None
seq = Seq("seq2", "ATGC", [10, 20, 30, 40])
assert seq.name == "seq2"
assert seq.seq == "ATGC"
assert seq.qual == [10, 20, 30, 40]
try:
Seq("seq3", "ATGC", [10, 20, 30])
self.failed("Should have raised a ValueError")
except ValueError:
pass
try:
Seq("ATGC")
self.failed("Should have raised a TypeError")
except TypeError:
pass
def test_seq_len(self):
seq = Seq("seq1", "ATGC")
assert len(seq) == 4
seq = Seq("seq2", "ATGC", [10, 20, 30, 40])
assert len(seq) == 4
def test_to_fastq(self):
seq = Seq("seq1", "ATGC")
try:
assert seq.to_fastq()
self.fail("Should have raised a RuntimeError")
except RuntimeError:
pass
seq = Seq("seq2", "ATGC", [10, 20, 30, 40])
assert seq.to_fastq() == "@seq2\nATGC\n+\n+5?I\n"
def test_to_fasta(self):
seq = Seq("seq1", "ATGC")
assert seq.to_fasta() == ">seq1\nATGC\n"
seq = Seq("seq2", "ATGC", [10, 20, 30, 40])
assert seq.to_fasta() == ">seq2\nATGC\n"
def test_reverse(self):
seq = Seq("seq1", "aTGC")
seq = seq.reverse()
assert seq.name == "seq1_rev"
assert seq.seq == "GCAt"
seq = Seq("seq2", "ATGC", [10, 20, 30, 40])
seq = seq.reverse()
assert seq.seq == "GCAT"
assert seq.qual == [40, 30, 20, 10]
if __name__ == "__main__":
unittest.main()
```
Hints
You can use the following dictionary to complement the IUPAC nucleotides:
```{python}
COMPLEMENT_NUCL = {
"A": "T",
"T": "A",
"C": "G",
"G": "C",
"N": "N",
"Y": "R",
"R": "Y",
"S": "S",
"W": "W",
"K": "M",
"M": "K",
"B": "V",
"V": "B",
"D": "H",
"H": "D",
}
```
Solutions
You can get a working solution for the OOP and procedural approaches.
Functional programming excersises
Fastq statistics
Write a program capable of calculating statistics of a fastq file. The statistics to calculate should be:
- number of sequences
- sequence length distribution
- mean quality per sequence distribution
- %GC per sequence distribution
Take into account that the fastq files are usually very big, so write the program using a generator to read the sequences.
The program should be able to read:
The result should be written in a json file. You can download an example fastq file in here.
Hints
It is faster to read the file in binary mode than in text mode.
```{python}
if fpath == "-":
fastq_fhand = sys.stdin.buffer
else:
fpath = Path(fpath)
if not fpath.exists():
print(f"File {fpath} does not exist")
sys.exit(1)
fastq_fhand = open(fpath, "rb")
fastq_fhand = io.BufferedReader(fastq_fhand)
```
To know if a file is gzipped check its first two bytes, the magic number of a gzipped file is:
```{python}
gzip_magic_number = b"\x1f\x8b"
```
You can peek (read without advancing the pointer of the file) if you wrap the file handler in a io.BufferedReader.
```{python}
if not isinstance(fhand, io.BufferedReader):
fhand = io.BufferedReader(fhand)
two_bytes = fastq_fhand.peek(2)[:2]
```
To open gzipped files use the gzip module.
```{python}
fhand = gzip.open(fhand, "rb")
```
Python has a module in the standar library to read and write json files.
Test suite
```{python}
import unittest
class TestFastqStats(unittest.TestCase):
def test_calc_fastq_stats(self):
fastq = b"@seq1\nATGC\n+\nIIII\n"
fastq_fhand = io.BytesIO(fastq)
out_stats_fhand = io.StringIO()
calc_fastq_stats(fastq_fhand, out_stats_fhand)
out_stats = json.loads(out_stats_fhand.getvalue())
self.assertEqual(out_stats["num_seqs"], 1)
self.assertEqual(out_stats["seq_len_distrib"], {"4": 1})
self.assertEqual(out_stats["gc_distrib"], {"50": 1})
self.assertEqual(out_stats["mean_qual_distrib"], {"40": 1})
```
Solution
If you want, you could take a peak of a possible solution that uses a generator to read the sequences.
Count Kmers
Write a program that, given a fastq file, counts the kmers found in the DNA sequences.
It should write down two files:
- A json file with the kmer abundance distribution.
- A text file with every kmers
The text file should include every kmer with a count higher than 1. In each line there should be the kmer sequence and the number of times that the kmer has been found in the sequences. The kmer distribution abundance refers to the number of kmers that have been found 1, 2, 3 o n number of times. For instance, 3 kmers have been found to be present 42 times in the sequences, 30 kmers 5 times, etc. Here you have a test with the expected output:
```{python}
import unittest
class TestKmerCounting(unittest.TestCase):
def test_kmer_counting(self):
fhand = io.BytesIO(
b"@seq1\nACTGGATCTTGCC\n+\nIIIIIIIIIII\n@seq2\nACTGGATATGCG\n+\nIIIIIIIIIIII\n@seq3\nTTCATGC\n+\nIIIIIII\n"
)
kmers_fhand = io.StringIO()
stats_fhand = io.StringIO()
count_kmers(fhand, kmers_fhand, stats_fhand, 5)
assert kmers_fhand.getvalue() == "ACTGG\t2\nCTGGA\t2\nTGGAT\t2\n"
assert json.loads(stats_fhand.getvalue()) == {"2": 3, "1": 14}
```
You can download a gzipped fastq file with 1000 sequences.
Hints
Parse the fastq file creating a generator, otherwise you would fill out your computer memory when reading millions of sequences.
You can use a Counter object to count the kmers.
Remember the hints given in the calculate fastq statistics regarding opening gzipped and not compressed fastq files.
Solution
If you want, you could take a peak of a possible solution.