03 June 2015

Background

RNA-seq short reads quantification, or simply, counting how many reads mapped to a set of genes, is the key step for differential expression analysis. Traditionally, reads were mapped to the reference genome and then were counted based on the mapping information. It normally takes days or weeks with multiple CPUs, just for the alignment. Recently, K-mer strategy was proposed as an alignment-free alternative, including Sailfish and kallisto. Strange names, are they? Anyway, with no doubt, these K-mer approaches run much faster comparing to alignment-based approaches. See below figure stole from Rob Patro’s presentation. Imgur

What is K-mer? Simply speaking, K-mer is a string of length K. For instance, “ATCG” is 4-mer sequence. For a given sequence of length N, it will extract N-k+1 k-mers. Imgur

The paradigm for a K-mer approach would be:
1. Build k-mer index of your target transcripts (e.g. your gene annotation).
2. Count k-mer in your RNA-seq reads.
3. Quantification.

Setup the experiment

To convince myself these approaches are as accurate as traditional ones, I setup an experiment to compare the results generated by both traditional alignment-based approach and K-mer approach. The analyzing codes were deposited in the github repo.

First, I downloaded some RNA-seq data from NCBI by using wget and obtained the reference genome from EnsemblePlants. I used GSNAP for short reads alignment. And some bioconductor packages, including GenomicFeatures and GenomicAlignments, for read count quantification. The detailed procedure can be found here for alignment and here for quantification. Or a shorter version here.

### get RNA-seq reads
wget get ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1174772/SRR1174772.sra
### get reference genome
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-25/fasta/oryza_indica/dna/Oryza_indica.ASM465v1.25.dna.chromosome.*.fa.gz
### alignment with gmap
gmap_build -D largedata/OS_indica/ -d ASM465v1.25_gsnap -g Oryza_indica.ASM465v1.25.dna.chromosome*fa.gz > gmapbuild.log

The kallisto approach

I used a recently released K-mer approach kallisto. It is on github and are well documented. The whole procedure takes only two steps for both single-end and paired-end sequencing as below.

### indexing
kallisto index -k 31 -i Osativa_204_v7_1transcript Osativa_204_v7.0.transcript_primaryTranscriptOnly
### quantification
kallisto quant -i index -o output pairA_1.fastq pairA_2.fastq pairB_1.fastq pairB_2.fastq

# Comparing of the results

The Pearson’s correlation coefficient (r) ranged from 0.8 to 0.94. That is an impressive correlation, indicating the kallisto is indeed as accurate as alignment-based approach. I looked into the outliers on the left sides of the plots and found those dots were from genes with repeat sequences. Note that with the alignment-based approach, I kept only uniquely mapped reads for quantification purpose. After removing these several outlier genes, the correlations become more significant.

Overall, I think kallisto is a simple and rapid RNA-seq quantification alternative for alignment-based approach. However, kallisto may over-estimate the read counts from genes containing repeats. For differential expression analysis, this may not be a critical problem since the over-estimation will not be biased.

Imgur



blog comments powered by Disqus