Duplicates causing issues, how to handle merging and clustering?


I am running qiime2 version 2024.2-amplicon.

I have single-end sequencing data of the ITS region. The data is from 11 separate runs, meaning I have 11 files of multiplexed sequences. There are replicates, as some samples have been poor, and they have been rerun in a later sequencing run.

The runs have been imported, demultiplexed and then ran through Dada2 denoising seperately. At this point I attempted to merge the separate tables and rep-seq artifacts into one for the rest of the pipeline. The replicates understandably caused errors in the merge. I wrote a python script to seek out duplicates, sample ID's found in multiple runs, and compare the read counts keeping the highest read count sample. Generated lists of samples to keep, per run, and filtered the table artifacts. After filtering the merge was successful.

The problem:
As I'm working with ITS sequences, I have understood they benefit greatly from a clustering step before taxonomic classification. As dada2 outputs ASV's, I attempted to run a de-novo clustering step with vsearch, but this resulted in an error to run. Features where missing from the tables, and vsearch requires features in sequences and tables to be identical.

I'm thinking, that the filtering of tables prior to merge, is the cause. ASV's found only in the dropped samples would in my mind explain the issue of not finding features present in the rep-seq artifact, from the merged-table artifact. If this is case, can I run the clustering per run, before doing the filtering based on ID's for a successful merge? Or is there something I am missing..

Another potential solution I thought of; Is it possible to import, demux and export the sequences, manually drop out the bad duplicates, and reimport the demuxed sequences into qiime2 to pick up from dada2 denoising.. My only worry here is will this hide any possible batch effects, as now dada2 denoising will be run on the combined sequences across 11 sequencing runs and not on each sequencing run individually, potentially affecting error rate learning.

Any assistance or guidance how to proceed is welcome! I have previously worked only with 16S data, and this ITS project has proven to be less straight forward.

Thank you,

Hello Ravio,

Welcome to the forums! :qiime2:

This is a great question! We just spoke about ASVs vs OTUs for ITS amplicons in this thread:

This context may be helpful, as the consensus in this community is to do as much as possible at the ASV level. You will have to decide what setup works best for your data.

Good news: taxonomy classification works equally well on any sequences (OTUs or ASVs). The benefit of converting your ASVs into something else is extremely dubious from my perspective.

Correct. DADA2 was designed to be run once on each sequencing run.


Hey Colin,

Thank you for the link to the discussion. Very insightful. I will arm myself with the key points and take the discussion to my supervisors. If taxonomy classification is equally effective for ITS ASVs; I have confidence forgoing the OTU clustering step is well informed and should allow me to continue the analysis forward with ASVs.

1 Like