BED format and bedtools

2016-02-21 | category Bioinformatics | tag NGS

1. Introduction

As described on the UCSC Genome Browser website (see link below), the BED format is a concise and flexible way to represent genomic features and annotations. The BED format description supports up to 12 columns, but only the first 3 are required for the UCSC browser, the Galaxy browser and for bedtools.

bedtools allows one to use the “BED12” format (that is, all 12 fields listed below). However, only intersectBed, coverageBed, genomeCoverageBed, and bamToBed will obey the BED12 “blocks” when computing overlaps, etc., via the “-split” option. For all other tools, the last six columns are not used for any comparisons by the bedtools. Instead, they will use the entire span (start to end) of the BED12 entry to perform any relevant feature comparisons. The last six columns will be reported in the output of all comparisons.

chrom - The name of the chromosome on which the genome feature exists. 
	For example, “chr1”, “contig1112.23”. This column is required.
start - The zero-based starting position of the feature in the chromosome. 
	The first base in a chromosome is numbered 0. This column is required.
end - The one-based ending position of the feature in the chromosome.
	This column is required.
name - Defines the name of the BED feature. This column is optional.
	For example, “LINE”, “Exon3”.
score - The UCSC definition requires that a BED score range from 0 to 1000, inclusive. 
	This column is optional.
strand - Defines the strand - either ‘+’ or ‘-‘. This column is optional.
thickStart - The starting position at which the feature is drawn thickly.
	Allowed yet ignored by bedtools.
thickEnd - The ending position at which the feature is drawn thickly.
	Allowed yet ignored by bedtools.
itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0).
	Allowed yet ignored by bedtools.
blockCount - The number of blocks (exons) in the BED line.
	Allowed yet ignored by bedtools.
blockSizes - A comma-separated list of the block sizes.
	Allowed yet ignored by bedtools.
blockStarts - A comma-separated list of block starts.
	Allowed yet ignored by bedtools.

2. bedtools Examples

2.0 intersect command

cat a.bed

Chr3	11699949	11700000
Chr3	11699967	11700018
Chr3	11699972	11700023

cat b.bed

Chr3	11699950	11699990
Chr3	11699970	11700020
Chr4	11699972	11700023

-wa Write the original entry in A for each overlap.

bedtools intersect -wa -a a.bed -b b.bed

Chr3	11699949	11700000
Chr3	11699949	11700000
Chr3	11699967	11700018
Chr3	11699967	11700018
Chr3	11699972	11700023
Chr3	11699972	11700023

-wb Write the original entry in B for each overlap.

bedtools intersect -wb -a a.bed -b b.bed

Chr3	11699950	11699990	Chr3	11699950	11699990
Chr3	11699970	11700000	Chr3	11699970	11700020
Chr3	11699967	11699990	Chr3	11699950	11699990
Chr3	11699970	11700018	Chr3	11699970	11700020
Chr3	11699972	11699990	Chr3	11699950	11699990
Chr3	11699972	11700020	Chr3	11699970	11700020

Write the original A and B entries plus the number of 
base pairs of overlap between the two features.

bedtools intersect -wo -a a.bed -b b.bed

Chr3	11699949	11700000	Chr3	11699950	11699990	40
Chr3	11699949	11700000	Chr3	11699970	11700020	30
Chr3	11699967	11700018	Chr3	11699950	11699990	23
Chr3	11699967	11700018	Chr3	11699970	11700020	48
Chr3	11699972	11700023	Chr3	11699950	11699990	18
Chr3	11699972	11700023	Chr3	11699970	11700020	48

2.1 How many overlaps (each overlap is reported on one line) between the bam file and the gtf file are reported?

To allow the input to be read directly from the BAM file, we use the option ‘-abam’. -bed If using BAM input, write output as BED.

bedtools intersect -abam test.bam -b test.gtf -bed -wo > overlaps.bed

This will create a file with the following format: Columns 1-12 : alignment information, converted to BED format Columns 13-21 : annotation (exon) information, from the GTF file Column 22 : length of the overlap

Alternatively, we could first convert the BAM file to BED format using ‘bedtools bamtobed’ then use the resulting file in the ‘bedtools intersect’ command. To answer the question, the number of overlaps reported is precisely the number of lines in the file (because only entries in the first file that have overlaps in file B are reported, according to the option ‘-wo’):

wc -l overlaps.bed

2.2 How many alignments overlap the annotations?

Columns 1-12 define the alignments:

cut -f1-12 overlaps.bed | sort -u | wc -l

2.3 Conversely, how many exons have reads mapped to them?

Columns 13-21 define the exons:

cut -f13-21 overlaps.bed | sort -u | wc -l

Previous Next