Hey,
I am running qiime2 version 2024.2-amplicon.
Background:
I have single-end sequencing data of the ITS region. The data is from 11 separate runs, meaning I have 11 files of multiplexed sequences. There are replicates, as some samples have been poor, and they have been rerun in a later sequencing run.
The runs have been imported, demultiplexed and then ran through Dada2 denoising seperately. At this point I attempted to merge the separate tables and rep-seq artifacts into one for the rest of the pipeline. The replicates understandably caused errors in the merge. I wrote a python script to seek out duplicates, sample ID's found in multiple runs, and compare the read counts keeping the highest read count sample. Generated lists of samples to keep, per run, and filtered the table artifacts. After filtering the merge was successful.
The problem:
As I'm working with ITS sequences, I have understood they benefit greatly from a clustering step before taxonomic classification. As dada2 outputs ASV's, I attempted to run a de-novo clustering step with vsearch, but this resulted in an error to run. Features where missing from the tables, and vsearch requires features in sequences and tables to be identical.
I'm thinking, that the filtering of tables prior to merge, is the cause. ASV's found only in the dropped samples would in my mind explain the issue of not finding features present in the rep-seq artifact, from the merged-table artifact. If this is case, can I run the clustering per run, before doing the filtering based on ID's for a successful merge? Or is there something I am missing..
Another potential solution I thought of; Is it possible to import, demux and export the sequences, manually drop out the bad duplicates, and reimport the demuxed sequences into qiime2 to pick up from dada2 denoising.. My only worry here is will this hide any possible batch effects, as now dada2 denoising will be run on the combined sequences across 11 sequencing runs and not on each sequencing run individually, potentially affecting error rate learning.
Any assistance or guidance how to proceed is welcome! I have previously worked only with 16S data, and this ITS project has proven to be less straight forward.
Thank you,
R.