1. Data
two raw data files were provided as the starting point: * day8.fastq from the first biological condition * day16.fastq from the second biological condition * genome.fa the reference genome * genes.gtf the reference gene annotations
2. Create reference index
bowtie2-build bwtIndex/genome.fa bwtIndex/genome
3. Run tophat using all default parameters
tophat -o output/tophat/day8/ bwtIndex/genome day8.fastq
tophat -o output/tophat/day16/ bwtIndex/genome day16.fastq
These will create the accepted_hits.bam
files containing the alignments, the align_summary.txt
files containing summary stats on the mapped reads, the unmapped.bam
files containing the records of unmapped reads.
3.1 How many spliced alignments were reported for the ‘day8’ data set?
spliced alignments contain ‘N’ in cigar score.
samtools view output/tophat/day8/accepted_hits.bam | cut -f 6 | grep 'N' | wc -l
4. Run cufflinks
Run cufflinks using the specified labels as prefix for naming the assembled transcripts.
cufflinks -o output/cufflinks/day8 -L Day8 output/tophat/day8/accepted_hits.bam
cufflinks -o output/cufflinks/day16 -L Day16 output/tophat/day16/accepted_hits.bam
These will generate the files transcripts.gtf
containing the assembled transcripts, as well as files *.fpkm_tracking
containing expression (FPKM) estimates for genes and
transcripts.
4.1 How many genes were generated by cufflinks for ‘day8’ data set?
cut -f9 output/cufflinks/day8/transcripts.gtf | cut -d ' ' -f2 | sort -u | wc -l
4.2 How many transcripts were generated by cufflinks for ‘day8’ data set?
cut -f9 output/cufflinks/day8/transcripts.gtf | cut -d ' ' -f4 | sort -u | wc -l
4.3 How many single transcript genes were generated by cufflinks for ‘day8’ data set?
cut -f9 output/cufflinks/day8/transcripts.gtf | grep -v "exon_number" | cut -d ' ' -f2 | sort | uniq -c | awk '$1==1' | wc -l
4.4 How many single-exon transcripts were generated by cufflinks for ‘day8’ data set?
cut -f9 output/cufflinks/day8/transcripts.gtf | grep "exon_number" | cut -d ' ' -f4 | sort | uniq -c | awk '$1==1' | wc -l
4.5 How many multi-exon transcripts were generated by cufflinks for ‘day8’ data set?
cut -f9 output/cufflinks/day8/transcripts.gtf | grep "exon_number" | cut -d ' ' -f4 | sort | uniq -c | awk '$1>1' | wc -l
5. Run cuffcompare
Run cuffcompare on the resulting cufflinks transcripts, using the reference gene annotations provided and selecting the option ‘-R’ to consider only the reference transcripts that overlap some input transfrag.
cuffcompare -r genes.gtf -R output/cufflinks/day8/transcripts.gtf
cuffcompare -r genes.gtf -R output/cufflinks/day16/transcripts.gtf
It compares the assembled transcripts against a set of reference gene annotations provided by the user, exon-by-exon, to determine which genes and transcripts in the sample are known, and which ones are likely novel. In the end, it assigns each predicted (cufflinks) transcript a ‘class’ code depending on how it relates to a reference transcript, for example: it is the same as a reference transcript (‘=’), it is only a portion of one (‘c’), a new splice variant of a reference gene (‘j’), etc. See details.
6. Run cuffmerge
echo output/cufflinks/day8/transcripts.gtf > gtf.txt
echo output/cufflinks/day16/transcripts.gtf >> gtf.txt
cuffmerge -g genes.gtf gtf.txt -o output/cuffmerge
7. Run cuffdiff
Run cuffdiff with the merged.gtf
file as reference annotation, taking the two alignment files as input.
cuffdiff -o output/cuffdiffs/ output/cuffmerge/merged.gtf output/tophat/day8/accepted_hits.bam output/tophat/day16/accepted_hits.bam
This will create the file gene_exp.diff
containing test scores and results for the gene-level differential expression analysis, other *.diff
files, as well as tracking files for genes, transcripts, splicing, CDS, TSS, etc.
7.1 How many genes were detected as differentially expressed?
grep –c "yes"" output/cuffdiffs/gene_exp.diff
7.2 How many transcripts were differentially expressed between the two samples?
grep –c yes output/cuffdiffs/isoform_exp.diff