1. Introduction
As described on the UCSC Genome Browser website (see link below), the BED format is a concise and flexible way to represent genomic features and annotations. The BED format description supports up to 12 columns, but only the first 3 are required for the UCSC browser, the Galaxy browser and for bedtools.
bedtools allows one to use the “BED12” format (that is, all 12 fields listed below). However, only intersectBed, coverageBed, genomeCoverageBed, and bamToBed will obey the BED12 “blocks” when computing overlaps, etc., via the “-split” option. For all other tools, the last six columns are not used for any comparisons by the bedtools. Instead, they will use the entire span (start to end) of the BED12 entry to perform any relevant feature comparisons. The last six columns will be reported in the output of all comparisons.
chrom - The name of the chromosome on which the genome feature exists.
For example, “chr1”, “contig1112.23”. This column is required.
start - The zero-based starting position of the feature in the chromosome.
The first base in a chromosome is numbered 0. This column is required.
end - The one-based ending position of the feature in the chromosome.
This column is required.
name - Defines the name of the BED feature. This column is optional.
For example, “LINE”, “Exon3”.
score - The UCSC definition requires that a BED score range from 0 to 1000, inclusive.
This column is optional.
strand - Defines the strand - either ‘+’ or ‘-‘. This column is optional.
thickStart - The starting position at which the feature is drawn thickly.
Allowed yet ignored by bedtools.
thickEnd - The ending position at which the feature is drawn thickly.
Allowed yet ignored by bedtools.
itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0).
Allowed yet ignored by bedtools.
blockCount - The number of blocks (exons) in the BED line.
Allowed yet ignored by bedtools.
blockSizes - A comma-separated list of the block sizes.
Allowed yet ignored by bedtools.
blockStarts - A comma-separated list of block starts.
Allowed yet ignored by bedtools.
2. bedtools Examples
2.0 intersect command
cat a.bed
Chr3 11699949 11700000
Chr3 11699967 11700018
Chr3 11699972 11700023
cat b.bed
Chr3 11699950 11699990
Chr3 11699970 11700020
Chr4 11699972 11700023
-wa Write the original entry in A for each overlap.
bedtools intersect -wa -a a.bed -b b.bed
Chr3 11699949 11700000
Chr3 11699949 11700000
Chr3 11699967 11700018
Chr3 11699967 11700018
Chr3 11699972 11700023
Chr3 11699972 11700023
-wb Write the original entry in B for each overlap.
bedtools intersect -wb -a a.bed -b b.bed
Chr3 11699950 11699990 Chr3 11699950 11699990
Chr3 11699970 11700000 Chr3 11699970 11700020
Chr3 11699967 11699990 Chr3 11699950 11699990
Chr3 11699970 11700018 Chr3 11699970 11700020
Chr3 11699972 11699990 Chr3 11699950 11699990
Chr3 11699972 11700020 Chr3 11699970 11700020
Write the original A and B entries plus the number of
base pairs of overlap between the two features.
bedtools intersect -wo -a a.bed -b b.bed
Chr3 11699949 11700000 Chr3 11699950 11699990 40
Chr3 11699949 11700000 Chr3 11699970 11700020 30
Chr3 11699967 11700018 Chr3 11699950 11699990 23
Chr3 11699967 11700018 Chr3 11699970 11700020 48
Chr3 11699972 11700023 Chr3 11699950 11699990 18
Chr3 11699972 11700023 Chr3 11699970 11700020 48
2.1 How many overlaps (each overlap is reported on one line) between the bam file and the gtf file are reported?
To allow the input to be read directly from the BAM file, we use the option ‘-abam’.
-bed If using BAM input, write output as BED.
bedtools intersect -abam test.bam -b test.gtf -bed -wo > overlaps.bed
This will create a file with the following format: Columns 1-12 : alignment information, converted to BED format Columns 13-21 : annotation (exon) information, from the GTF file Column 22 : length of the overlap
Alternatively, we could first convert the BAM file to BED format using ‘bedtools bamtobed’ then use the resulting file in the ‘bedtools intersect’ command. To answer the question, the number of overlaps reported is precisely the number of lines in the file (because only entries in the first file that have overlaps in file B are reported, according to the option ‘-wo’):
wc -l overlaps.bed
2.2 How many alignments overlap the annotations?
Columns 1-12 define the alignments:
cut -f1-12 overlaps.bed | sort -u | wc -l
2.3 Conversely, how many exons have reads mapped to them?
Columns 13-21 define the exons:
cut -f13-21 overlaps.bed | sort -u | wc -l