Help/advice on truncation parameters

ryanbb · July 17, 2025, 9:46pm

Poor/inconsistent taxonomy classification after denoising

I'm working with 16S data from human microbiome samples. The samples arrived demultiplexed and I've removed the supplied illumina primers using cutadapt. I've run demux summarise on the trimmed data to get the quality plots and make my decisions about truncation.

The quality plots are:

The size distributions are:

To me the data looks pretty good quality and if we use 20 as a dip/drop in Phred quality combined with the size distributions I chose truncation values of (forward) 270 and (reverse) 265. I ran a final denoising command as:

qiime dada2 denoise-paired 
                --p-n-threads 10
                --i-demultiplexed-seqs trim-demux.qza
                --p-trunc-q 2
                --p-trim-left-f 0
                --p-trunc-len-f 270
                --p-trim-left-r 0
                --p-trunc-len-r 265
                --p-max-ee-r 2
                --p-max-ee-f 2
                --p-min-fold-parent-over-abundance 1
                --o-representative-sequences asv-seqs.qza 
                --o-table asv-table.qza 
                --o-denoising-stats stats.qza

This is the resulting stats file:
stats.qza (20.7 KB)

Its a little confusing what the truncation parameter is doing as the documentation says

Position at which reverse read sequences should be truncated due to decrease in quality. This truncates the 3' end of the of the input sequences, which will be the bases that were sequenced in the last cycles. Reads that are shorter than this value will be discarded. After this parameter is applied there must still be at least a 12 nucleotide overlap between the forward and reverse reads. If 0 is provided, no truncation or length filtering will be performed

Suggesting that any reads shorter than the truncation parameter will be discarded. But this post and this post seem to suggest the documentation is incorrect and that reads shorter than this length are not discarded. Some clarification here would be much appreciated.

I then used 2 different pre-computed classifiers (GTDB and GTDB human stool weighted) for taxonomic assignment. This is where it gets confusing. For these parameters both the GTDB and GTDB-human showed no classification past the domain level. But if I do no truncation (set to 0) or truncate as above but increase the max-ee parameter to 5 and the min-fold-parent-over-abundance parameter to 5 I see hundreds of annotations at genus/species level when using the GTDB-human classifier (but no annotations past the domain level using the GTDB classifier).

So my questions are:

Do these data need truncation? If so, what would be likely parameters to use?
I understand that tweeking parameters (no truncation, increasing max-ee etc) can lead to better taxonomic assignment but why would we see the different effects between the GTDB and GTDB-human classifiers?

I know how important this step is to all downstream analysis so I would really like ti get this done as well as possible and would really appreciate any help/advice. Thanks!

tripitakit · July 18, 2025, 11:01am

ryanbb:

Its a little confusing what the truncation parameter is doing as the documentation says

Position at which reverse read sequences should be truncated due to decrease in quality. This truncates the 3' end of the of the input sequences, which will be the bases that were sequenced in the last cycles. Reads that are shorter than this value will be discarded. After this parameter is applied there must still be at least a 12 nucleotide overlap between the forward and reverse reads. If 0 is provided, no truncation or length filtering will be performed

Suggesting that any reads shorter than the truncation parameter will be discarded. But this post and this post seem to suggest the documentation is incorrect and that reads shorter than this length are not discarded. Some clarification here would be much appreciated.

I second you with the same confusion about it. Thanks for giving it words.

colinbrislawn · July 21, 2025, 5:17pm

Good morning!

There is one exception:

I'm guessing this is a safety feature from DADA2, which we preserve in the Qiime2 plugin.

Surely, no one wants to remove 100% of their reads!

Perhaps it would be better to throw a warning and reprimand users who discard all their reads with a long trunc-len value. What do you think?

colinbrislawn · July 21, 2025, 5:28pm

Hello Ryan!

Welcome to the forums!

Let me return to your first question:

Looks like you found Liz's posts on truncation. I approach this decision a little differently, so perhaps you would find my post on DADA2 trimming, truncation, and merging as a helpful comparison.

I always truncate data when running dada2!

When working with a new batch of samples or new cohort, I often run DADA2 multiple times to see what settings cause the most reads to join. Like, half a dozen times!

This is called 'parameter sweeping' or 'guess and check'!

You can decide what to run next!

what I recommend:

I would try shorter values for --p-trunc-len-ras R2 looks pretty rough.

220 might be two short but it would cause a lot more reads to pass filter without resorting to raising max-ee parameter to 5. Depends on how this goes, either trunc shorter to remove more of the low quality R2, or trunc longer to get more overlap and thus more merges.

ryanbb · July 22, 2025, 9:56am

Thank you @colinbrislawn. It's not about discarding all reads in a dataset but what happens to say a single read, which by primer removal, is now shorter than the --p-trunc-len but still has good quality. Do we know whether that read is removed or kept?

If possible I think throwing warnings about poor paramater selection is always useful.

colinvwood · July 22, 2025, 5:48pm

Hello @ryanbb,

Such a single read is removed, according to the dada2 docs. If you believe the dada2 docs to be incorrect then the dada2 GitHub issue page is the best place to raise such a concern. For external packages such as this one, when we "wrap" them in qiime2 we generally copy the parameter descriptions over.

ryanbb · July 22, 2025, 8:05pm

@colinbrislawn thank you for the insights and for the introduction to the term 'parameter sweeping' I had been calling it 'parameter tuning'! I had been running the denoising step multiple times and assessing the impacts on not only reads that survive the denoising step and merge but their effects on taxonomic classification using both the GTDB and GTDB human stool weighted classifiers.

And by modifying the truncation parameters I can identifying multiple sets of parameters that result in many reads passing the filtering and merging. However, what I can then see is with certain parameters the GTDB classifier cannot assign ASVs to the taxonomy below the domain level but will assign taxonomy to the species level with the human weighted GTDB classifier. If I modify the parameters, as per your suggestion, I can again see many reads passing the denoising step but the pattern of classification by the two classifiers is reversed!

So now I'm wondering is it suitable to use different truncation parameters for different classifiers? Is the instability of taxonomic assignments between the denoising parameters I'm using indicating that something is wrong with the data or what I'm doing?

colinbrislawn · July 23, 2025, 4:56am

Hello Ryan,

Thank you for your detailed methods section. I'm not sure what's going on here, but I've got a few things to try next.

This make's sense to me:

And then parameters that lead to the most reads merging leads to the best taxonomy!

I'm surprised too!

Maybe fewer ASVs are making it to the taxonomy table but a greater percentage of them can be classified?

Maybe an LCA classifier like feature-classifier classify-consensus-vsearch would work better here?

Deyan_Donchev · July 23, 2025, 4:20pm

I've always wondered if there are others like me who run multiple iterations with gradiently increasing/decreasing trimming parameters.
Basically, I do a bash loop to combine all possible trimming parameters by an increment of 5. So something like forward reads trimmed (250, 255, 260, and so on) each combined with reverse reads trimmed (180, 185, 190, and so on, usually much shorter).
Then my loop exports the content of the stats.qza file and grabs a few samples from the table and merges them into one final table output to compare retention rates. There are usually two very clear thresholds: the lower one, where the retention drops to nearly 0%, due to overtrimming and not enough overlap (the 12bp overlap minimum from dada2) and the sweet spot, where nearly 50-70% of merged reads are retained (exact percentage depends on the sequencing quality of the run), after which a slow, gradual reduction of the retention rate begins, because longer fragments=lower avg. quality. For example, 65% (highest) at 260-180 (420bp insert + 12bp overlap), 64.1% at 265-175, or 63.9% at 265-180, and 0.5% at 255-175 (too short, no overlap, dada2 throws those out).

colinbrislawn · July 23, 2025, 6:21pm

Yes!

Do you have a graph of that to show people?

That's pretty much the pattern I see too, but I don't think we have a picture of it on the forums!

Deyan_Donchev · July 24, 2025, 12:31pm

retained_summary.tsv (940 Bytes)
Here is the final tsv file from my last set of samples. Nothing fancy, no graph, but its clearly visible. This is just for the first sample from the whole set.
Some notes:

going by increment of 5 could be used initially to pin-point your overall sweet spot, after that (if you want perfection) you can narrow down the trunc parameters by 1.
the "best retention rates" trunc parameters may slightly vary for each sample, but once you find the sweet spot, you know you did a good job.

yangyue · July 24, 2025, 6:25pm

According to my experience, becauce I've had original dataset with primers, so after the triming, my command line was (for example, I've dealed with many types of funcional genes, this is one of them):

time qiime dada2 denoise-paired \
   --i-demultiplexed-seqs trimmed-seqs.qza \
   --o-table table-dada2.qza \
   --o-representative-sequences rep-seqs-dada2.qza \
   --p-trim-left-f 0 \
   --p-trim-left-r 0 \
   --p-trunc-len-f 264 \
   --p-trunc-len-r 264 \
   --o-denoising-stats denoising-stats-dada2.qza
--verbose

So I've found no overlap, so I think I won't worry about other things, just take care of the whole seqs-qualities.

colinbrislawn · July 24, 2025, 6:31pm

I made the graph! (code here)

colinbrislawn · July 24, 2025, 6:47pm

There is a lot going on in this discussion! I would like to return to this original question:

I treat these two questions as separate.

What denoising settings maximize the % joining or % retention
What feature-classifier settings and databases maximize taxonomy assignment

I would optimize dada2 first, like this

Once that's finalized I would move on to taxonomy.
Otherwise, I could go in circles forever!

system · August 25, 2025, 12:47am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.