tophat + cufflinks

| category Bioinformatics  | tag RNA-Seq 

1. Data

two raw data files were provided as the starting point: * day8.fastq from the first biological condition * day16.fastq from the second biological condition * genome.fa the reference genome * genes.gtf the reference gene annotations

2. Create reference index

bowtie2-build bwtIndex/genome.fa bwtIndex/genome

3. Run tophat using all default parameters

tophat -o output/tophat/day8/ bwtIndex/genome day8.fastq 
tophat -o output/tophat/day16/ bwtIndex/genome day16.fastq

These will create the accepted_hits.bam files containing the alignments, the align_summary.txt files containing summary stats on the mapped reads, the unmapped.bam files containing the records of unmapped reads.

3.1 How many spliced alignments were reported for the ‘day8’ data set?

spliced alignments contain ‘N’ in cigar score.

samtools view output/tophat/day8/accepted_hits.bam | cut -f 6 | grep 'N' | wc -l

4. Run cufflinks

Run cufflinks using the specified labels as prefix for naming the assembled transcripts.

cufflinks -o output/cufflinks/day8 -L Day8 output/tophat/day8/accepted_hits.bam
cufflinks -o output/cufflinks/day16 -L Day16 output/tophat/day16/accepted_hits.bam

These will generate the files transcripts.gtf containing the assembled transcripts, as well as files *.fpkm_tracking containing expression (FPKM) estimates for genes and transcripts.

cut -f9 output/cufflinks/day8/transcripts.gtf | cut -d ' ' -f2 | sort -u | wc -l
cut -f9 output/cufflinks/day8/transcripts.gtf | cut -d ' ' -f4 | sort -u | wc -l
cut -f9 output/cufflinks/day8/transcripts.gtf | grep -v "exon_number" | cut -d ' ' -f2 | sort | uniq -c | awk '$1==1' | wc -l
cut -f9 output/cufflinks/day8/transcripts.gtf | grep "exon_number" | cut -d ' ' -f4 | sort | uniq -c | awk '$1==1' | wc -l
cut -f9 output/cufflinks/day8/transcripts.gtf | grep "exon_number" | cut -d ' ' -f4 | sort | uniq -c | awk '$1>1' | wc -l

5. Run cuffcompare

Run cuffcompare on the resulting cufflinks transcripts, using the reference gene annotations provided and selecting the option ‘-R’ to consider only the reference transcripts that overlap some input transfrag.

cuffcompare -r genes.gtf -R output/cufflinks/day8/transcripts.gtf
cuffcompare -r genes.gtf -R output/cufflinks/day16/transcripts.gtf

It compares the assembled transcripts against a set of reference gene annotations provided by the user, exon-by-exon, to determine which genes and transcripts in the sample are known, and which ones are likely novel. In the end, it assigns each predicted (cufflinks) transcript a ‘class’ code depending on how it relates to a reference transcript, for example: it is the same as a reference transcript (‘=’), it is only a portion of one (‘c’), a new splice variant of a reference gene (‘j’), etc. See details.

6. Run cuffmerge

echo output/cufflinks/day8/transcripts.gtf > gtf.txt
echo output/cufflinks/day16/transcripts.gtf >> gtf.txt
cuffmerge -g genes.gtf gtf.txt -o output/cuffmerge

7. Run cuffdiff

Run cuffdiff with the merged.gtf file as reference annotation, taking the two alignment files as input.

cuffdiff -o output/cuffdiffs/ output/cuffmerge/merged.gtf output/tophat/day8/accepted_hits.bam output/tophat/day16/accepted_hits.bam

This will create the file gene_exp.diff containing test scores and results for the gene-level differential expression analysis, other *.diff files, as well as tracking files for genes, transcripts, splicing, CDS, TSS, etc.

7.1 How many genes were detected as differentially expressed?

grep –c "yes"" output/cuffdiffs/gene_exp.diff

7.2 How many transcripts were differentially expressed between the two samples?

grep –c yes output/cuffdiffs/isoform_exp.diff

Previous     Next