Poor/inconsistent taxonomy classification after denoising
I'm working with 16S data from human microbiome samples. The samples arrived demultiplexed and I've removed the supplied illumina primers using cutadapt. I've run demux summarise on the trimmed data to get the quality plots and make my decisions about truncation.
To me the data looks pretty good quality and if we use 20 as a dip/drop in Phred quality combined with the size distributions I chose truncation values of (forward) 270 and (reverse) 265. I ran a final denoising command as:
This is the resulting stats file: stats.qza (20.7 KB)
Its a little confusing what the truncation parameter is doing as the documentation says
Position at which reverse read sequences should be truncated due to decrease in quality. This truncates the 3' end of the of the input sequences, which will be the bases that were sequenced in the last cycles. Reads that are shorter than this value will be discarded. After this parameter is applied there must still be at least a 12 nucleotide overlap between the forward and reverse reads. If 0 is provided, no truncation or length filtering will be performed
Suggesting that any reads shorter than the truncation parameter will be discarded. But this post and this post seem to suggest the documentation is incorrect and that reads shorter than this length are not discarded. Some clarification here would be much appreciated.
I then used 2 different pre-computed classifiers (GTDB and GTDB human stool weighted) for taxonomic assignment. This is where it gets confusing. For these parameters both the GTDB and GTDB-human showed no classification past the domain level. But if I do no truncation (set to 0) or truncate as above but increase the max-ee parameter to 5 and the min-fold-parent-over-abundance parameter to 5 I see hundreds of annotations at genus/species level when using the GTDB-human classifier (but no annotations past the domain level using the GTDB classifier).
So my questions are:
Do these data need truncation? If so, what would be likely parameters to use?
I understand that tweeking parameters (no truncation, increasing max-ee etc) can lead to better taxonomic assignment but why would we see the different effects between the GTDB and GTDB-human classifiers?
I know how important this step is to all downstream analysis so I would really like ti get this done as well as possible and would really appreciate any help/advice. Thanks!
Looks like you found Liz's posts on truncation. I approach this decision a little differently, so perhaps you would find my post on DADA2 trimming, truncation, and merging as a helpful comparison.
I always truncate data when running dada2!
When working with a new batch of samples or new cohort, I often run DADA2 multiple times to see what settings cause the most reads to join. Like, half a dozen times!
This is called 'parameter sweeping' or 'guess and check'!
You can decide what to run next!
what I recommend:
I would try shorter values for --p-trunc-len-ras R2 looks pretty rough.
220 might be two short but it would cause a lot more reads to pass filter without resorting to raising max-ee parameter to 5. Depends on how this goes, either trunc shorter to remove more of the low quality R2, or trunc longer to get more overlap and thus more merges.
Thank you @colinbrislawn. It's not about discarding all reads in a dataset but what happens to say a single read, which by primer removal, is now shorter than the --p-trunc-len but still has good quality. Do we know whether that read is removed or kept?
If possible I think throwing warnings about poor paramater selection is always useful.
Such a single read is removed, according to the dada2 docs. If you believe the dada2 docs to be incorrect then the dada2 GitHub issue page is the best place to raise such a concern. For external packages such as this one, when we "wrap" them in qiime2 we generally copy the parameter descriptions over.
@colinbrislawn thank you for the insights and for the introduction to the term 'parameter sweeping' I had been calling it 'parameter tuning'! I had been running the denoising step multiple times and assessing the impacts on not only reads that survive the denoising step and merge but their effects on taxonomic classification using both the GTDB and GTDB human stool weighted classifiers.
And by modifying the truncation parameters I can identifying multiple sets of parameters that result in many reads passing the filtering and merging. However, what I can then see is with certain parameters the GTDB classifier cannot assign ASVs to the taxonomy below the domain level but will assign taxonomy to the species level with the human weighted GTDB classifier. If I modify the parameters, as per your suggestion, I can again see many reads passing the denoising step but the pattern of classification by the two classifiers is reversed!
So now I'm wondering is it suitable to use different truncation parameters for different classifiers? Is the instability of taxonomic assignments between the denoising parameters I'm using indicating that something is wrong with the data or what I'm doing?
I've always wondered if there are others like me who run multiple iterations with gradiently increasing/decreasing trimming parameters.
Basically, I do a bash loop to combine all possible trimming parameters by an increment of 5. So something like forward reads trimmed (250, 255, 260, and so on) each combined with reverse reads trimmed (180, 185, 190, and so on, usually much shorter).
Then my loop exports the content of the stats.qza file and grabs a few samples from the table and merges them into one final table output to compare retention rates. There are usually two very clear thresholds: the lower one, where the retention drops to nearly 0%, due to overtrimming and not enough overlap (the 12bp overlap minimum from dada2) and the sweet spot, where nearly 50-70% of merged reads are retained (exact percentage depends on the sequencing quality of the run), after which a slow, gradual reduction of the retention rate begins, because longer fragments=lower avg. quality. For example, 65% (highest) at 260-180 (420bp insert + 12bp overlap), 64.1% at 265-175, or 63.9% at 265-180, and 0.5% at 255-175 (too short, no overlap, dada2 throws those out).
retained_summary.tsv (940 Bytes)
Here is the final tsv file from my last set of samples. Nothing fancy, no graph, but its clearly visible. This is just for the first sample from the whole set.
Some notes:
going by increment of 5 could be used initially to pin-point your overall sweet spot, after that (if you want perfection) you can narrow down the trunc parameters by 1.
the "best retention rates" trunc parameters may slightly vary for each sample, but once you find the sweet spot, you know you did a good job.
According to my experience, becauce I've had original dataset with primers, so after the triming, my command line was (for example, I've dealed with many types of funcional genes, this is one of them):