Will change in truncated/trimming lead to different taxonomical classification

steffi · October 2, 2018, 4:52am

Dear All,
Ive executed qiime2 for my metagenomics samples. in the quality control step, I used dada2 plugin. Ive attached my interactive quality plot of my samples. My forword reads were fine and quality of the Reverse reads were not good at the end.

When I used the trimming length as 299 (forward) and 260 (reverse) with following command:

qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 299 --p-trunc-len-r 260 --o-representative-sequences rep-seqs.qza --o-table table.qza --o-denoising-stats stats.qza.

I got the following taxonomic classification.

But I changed my trimming length as 301(forwad) and 260 (reverse) with the following command:

qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 301 --p-trunc-len-r 260 --o-representative-sequences rep-seqs.qza --o-table table.qza --o-denoising-stats stats.qza.

I got entirely different classification. How do I decide my trimming length?. Kindly help me.

Nicholas_Bokulich · October 2, 2018, 2:03pm

Hi @steffi,
Great question! Yes, different trimming parameters can impact taxonomic classification results, particularly for paired-end reads (such as you have).

This can happen in a few ways, and let's start with your case:

Truncating less (i.e., passing longer reads) may incorporate more noise into the reads (since quality usually decreases at the 3' ends of reads). On paired-end reads this could impact the alignment quality (further increasing the noise) but even on single-end reads incorrect base calls could impact classification.
Longer reads will potentially contain more species-discriminative information (as long as the reads are high-quality). So truncating/trimming may impact how taxonomically informative a read is. Below is a classic depiction of this, from this paper.

So I think case #1 is probably what is going on in your case. It looks like quality really drops off badly on that last base, so yes even truncating 2 nt away can influence quality, as your case has shown!

Go with the shorter sequences (the first example you gave) — those classifications look good. The second classifications look really bad... I think the noisy bases at the 3' end of the forward read are causing the paired-end read joining to mismatch, which is confusing the classifier (it may even be interpreting these reads in the wrong orientation because the alignment looks really bad).

Now for that cool figure showing the relationship between sequence length and classification accuracy (for RDP classifier, NOT any of the QIIME 2 classifiers, but the same general trends will hold and this is just a really neat figure):

steffi · October 3, 2018, 5:13am

thank you for the clear explanation. Ill go with 299 (forward) and 260 (reverse).