Bioinformatics at COMAV

BEDTools

The BEDTools allow a fast and flexible way of comparing large datasets of genomic features. The BEDtools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. Its name is due to an historical reason because nowadays they can process the most commonly used feature file formats like: BED, GFF, VCF, and SAM. The following are examples of common questions that one can address with BEDTools:

  • Which SNPs are in a coding region?
  • Which are the exonic and intronic coverages?
  • How many positions have a coverage greater than 8?
  • Which SNPs are shared by two predictions done by two different SNP callers?

The following notes a partially taken from the excellent BEDTools manual. It is important to read this manual before using BEDTools in a real question.

BEDTools usage

Some examples taken from the excellent BEDTools manual.

Report the base-pair overlap between the features in two BED files:

$intersectBed -a reads.bed -b genes.bed

Report those entries in A that overlap NO entries in B:

$ intersectBed -a reads.bed -b genes.bed –v

Read BED A from stdin. Useful for stringing together commands. For example, find genes that overlap LINEs but not SINEs:

$ intersectBed -a genes.bed -b LINES.bed | intersectBed -a stdin -b SINEs.bed –v

Find the closest ALU to each gene:

$ closestBed -a genes.bed -b ALUs.bed

Merge overlapping repetitive elements into a single entry, returning the number of entries merged:

$ mergeBed -i repeatMasker.bed -n

Merge nearby repetitive elements into a single entry, so long as they are within 1000 bp of one another:

$ mergeBed -i repeatMasker.bed -d 1000

Summary of the main BEDTools

intersectBed

This tools answers the question of which features of two sets overlap. It works with BED/GFF/VCF and BAM files.

intersectBed can report different overlaps. By default it will report the overlapping segment:

Chromosome  ================================================================
BED/BAM A       ================                 ===============
BED File B             ===============
Result                 =========

For example:

$ cat A.bed
chr1 100  200
chr1 300  400
$ cat B.bed
chr1 150  250
$ intersectBed –a A.bed -b B.bed
chr1 150  200

It can also report eh original A (-wa) feature or the original B feature (-wb).

Chromosome  ================================================================
BED/BAM A       ================                 ===============
BED File B             ===============
Result          ================
$ intersectBed –a A.bed –b B.bed -wa
chr1 100  200

If –wb is used, the overlapping portion of A will be reported followed by the original “B”:

$ intersectBed –a A.bed –b B.bed -wb
chr1 150  200  chr1   150  250

intersectBed can also count, with the -c parameter, the number of features that overlap in every region defined in the a file.

intersectBed has many other features and possibilities that are explained in its manual, like minimum fraction of overlap or strandedness.

windowBed

Similar to intersectBed, windowBed searches for overlapping features in A and B. However, windowBed adds a specified number (1000, by default) of base pairs upstream and downstream of each feature in A. In effect, this allows features in B that are “near” features in A to be detected. If an overlap is found in B, both the original A feature and the original B feature are reported.

Chromosome  ================================================================
BED File A          <----------=============---------->
BED File B     ========                                    ==========
Result         ========
$ cat A.bed
chr1 1000 1500
$ cat B.bed
chr1 100  300
chr1 2800 3200
$ windowBed –a A.bed –b B.bed
chr1 1000 1500  chr1 100  300

closestBed

Similar to intersectBed, closestBed searches for overlapping features in A and B. In the event that no feature in B overlaps the current feature in A, closestBed will report the closest (that is, least genomic distance from the start or end of A) feature in B. For example, one might want to find which is the closest gene to a significant GWAS polymorphism. Note that closestBed will report an overlapping feature as the closest. That is, it does not restrict to closest non-overlapping feature.

Chromosome   ================================================================
BED File A                                  ========
BED File B        ========                              ======
Result                                                  ======
$ cat A.bed
chr1 600  700
$ cat B.bed
chr1 100  200
chr1 900  1000
$ windowBed –a A.bed –b B.bed
chr1 600  700  chr1  900  1000

subtractBed

subtractBed searches for features in B that overlap A. If an overlapping feature is found in B, the overlapping portion is removed from A and the remaining portion of A is reported. If a feature in B overlaps all of a feature in A, the A feature will not be reported.

Chromosome    ================================================================
BED File A             ===========                        =======
BED File B       =========                            =============
Result                    ========
$ cat A.bed
chr1 10  30
chr1 80  90
$ cat B.bed
chr1 5   20
chr1 70  95
$ subtractBed –a A.bed –b B.bed
chr1 20  30

mergeBed

mergeBed combines overlapping or “book-ended” (that is, one base pair away) features in a feature file into a single feature which spans all of the combined features.

Chromosome    ================================================================
BED File A             ===========                             =======
                 =========                            =============   ====
Result           =================                    ====================
$ cat A.bed
chr1 100  200
chr1 180  250
chr1 700  800
chr1 600  750
chr1 800  850
$ mergeBed –i A.bed
chr1 100  250
chr1 600  850

coverageBed

coverageBed computes both the depth and breadth of coverage of features in file A across the features in file B. For example, coverageBed can compute the coverage of sequence alignments (file A) across 1 kilobase (arbitrary) windows (file B) tiling a genome of interest. One advantage that coverageBed offers is that it not only counts the number of features that overlap an interval in file B, it also computes the fraction of bases in B interval that were overlapped by one or more features. Thus, coverageBed also computes the breadth of coverage for each interval in B.

After each interval in B, coverageBed will report:
The number of features in A that overlapped (by at least one base pair) the B interval. The number of bases in B that had non-zero coverage from features in A. The length of the entry in B. The fraction of bases in B that had non-zero coverage from features in A.
Chromosome  ================================================================
BED File B   ===============    ================    ======    =============
BED File A   ==== ====             ==             =========    === == ====
BED File A     ========                                     ===== ===== ==
Result        [ N=3, 10/15 ]     [ N=1, 2/16 ]    [N=1,6/6]   [N=6, 11/12]
$ cat A.bed
chr1  10  20
chr1  20  30
chr1  30  40
chr1  100 200
$ cat B.bed
chr1  0   100
chr1  100 200
chr2  0   100
$ coverageBed –a A.bed –b B.bed
chr1  0   100  3  30   100  0.3000000
chr1  100 200  1  100  100  1.0000000
chr2  0   100  0  0    100  0.0000000

genomeCoverageBed

genomeCoverageBed computes a histogram of feature coverage (e.g., aligned sequences) for a given genome. Optionally, by using the –d option, it will report the depth of coverage at each base on each chromosome in the genome file (-g).

The default output format is as follows:
chromosome (or entire genome) depth of coverage from features in input file number of bases on chromosome (or genome) with depth equal to column 2. size of chromosome (or entire genome) in base pairs fraction of bases on chromosome (or entire genome) with depth equal to column 2.
$ cat A.bed
chr1  10  20
chr1  20  30
chr2  0   500
$ cat my.genome
chr1 1000
chr2 500
$ genomeCoverageBed –i A.bed –g my.genome
chr1 0  980  1000 0.98
chr1 1  20   1000 0.02
chr2 1  500  500  1
genome  0    980  1500 0.653333
genome  1    520  1500 0.346667

fastaFromBed

fastaFromBed extracts sequences from a FASTA file for each of the intervals defined in a BED file.

$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
$ cat test.bed
chr1 5  10
$ fastaFromBed -fi test.fa -bed test.bed -fo test.fa.out
$ cat test.fa.out
>chr1:5-10
AAACC

SortBed

sortBed sorts a feature file by chromosome and other criteria.

By default, sortBed sorts a BED file by chromosome and then by start position in ascending order.

$ cat A.bed
chr1  800 1000
chr1  80  180
chr1  1   10
chr1  750 10000
$ sortBed –i A.bed
chr1  1   10
chr1  80  180
chr1  750 10000
chr1  800 1000

It should be noted that sortBed is merely a convenience utility, as the UNIX sort utility will sort BED files more quickly while using less memory. For example, UNIX sort will sort a BED file by chromosome then by start position in the following manner:

$ sort -k 1,1 -k2,2 -n a.bed
chr1  1   10
chr1  80  180
chr1  750 10000
chr1  800 1000

Other files used by BEDTools

Some tools need the length of the genome being used. In those cases a tab-delimited genome file with the length of the molecules of the genome has to be provided. An example would be:

chrI  15072421
chrII 15279323
...
chrX  1771885
chrM  13794

Another format used by BEDTools is BEDPE. This format is used to represent disjoint features like the paired-end sequence alignments.

Notes regarding usage

All BEDTools load the “B” file into memory and process the “A” file one-by-one against the features in “B”. Therefore when possible, one should make set the smaller of the two files to be the “B” file. For example, you’ll discover that finding overlaps between a list of 30,000 genes and 100 million aligned sequences will work much more efficiently with the genes file set as BED file “B”.

Most of the BEDTools allow the “A” file to be passed via standard input for use in UNIX “streams” or “pipelines”. In order to do this, use “-a stdin”. For example:

$ cat reads.bed | intersectBed -a stdin -b genes.bed > readsToGenes.bed

Practical Task

Given the following genome calculate some statistics:

                      11111111112222222222333333333344444444445555555555
             12345678901234567890123456789012345678901234567890123456789
             00000000000000000000000000000000000000000000000000000000000
Chromosome   ===========================================================
genes           ===============>                ===============>
CDSs              =========>                      ========>
exons           ====  === =====>                === === =======>
reads           --->  ->   -->
                  ->  --> -->
                -->
SNPs          * *     *     *    *  *   *   *  * *       *

Download and extract the bed files required for the practice are in the file: bed.tar.gz.

~$ pwd
/home/ngs_user
~$ tar -xvzf bed.tar.gz
~$ cd bed
  1. Which are the exon regions in the CDSs?
~/bed$ intersectBed -a cds.bed -b exons.bed > cds.exon.bed
~/bed$ cat cds.exon.bed
chr   59      70      cds1
chr   99      120     cds1
chr   139     150     cds1
chr   379     380     cds2
chr   399     420     cds2
chr   439     460     cds2
  1. Store the UTR regions in a new bed file.
~/bed$ subtractBed -a exons.bed -b cds.bed > utrs.bed
~/bed$ cat utrs.bed
chr   39      59      exon11
chr   150     190     exon13
chr   359     379     exon21
chr   460     510     exon23
  1. How many SNPs are in every gene?
~/bed$ intersectBed -a genes.bed -b snps.bed -c
chr  39      190     gene1   3
chr  359     510     gene2   2
  1. And how many in the UTRs and in the CDSs?
~/bed$ intersectBed -a utrs.bed -b snps.bed -c
chr   39      59      exon11  1
chr   150     190     exon13  1
chr   359     379     exon21  1
chr   460     510     exon23  0
~/bed$ intersectBed -a cds.exon.bed -b snps.bed -c
chr   59      70      cds1    0
chr   99      120     cds1    1
chr   139     150     cds1    0
chr   379     380     cds2    0
chr   399     420     cds2    0
chr   439     460     cds2    1
  1. How many reads cover each gene?
~/bed$ coverageBed -a reads.bed -b genes.bed
chr   39      190     gene1   7       83      151     0.5496688
chr   359     510     gene2   0       0       151     0.0000000
  1. Which is the distribution of read coverages along the chromosome?
~/bed$ genomeCoverageBed -i reads.bed -g chr.genome
chr   0       507     590     0.859322
chr   1       30      590     0.0508475
chr   2       52      590     0.0881356
chr   3       1       590     0.00169492
genome        0       507     590     0.859322
genome        1       30      590     0.0508475
genome        2       52      590     0.0881356
genome        3       1       590     0.00169492
| | index