Dereplicate sequences after DADA2

ChrisKeefe · June 3, 2021, 11:12pm

Hi @JadeS!
This is a little out of my league, but let's see what we can figure out together! First, would you mind opening separate topics for your SRA and literature questions? (SRA can go in User Support, lit should go in General Discussion). I'd like to keep things from getting too complicated here.

Am I understanding correctly that:

you ran DADA2 on your data, exported the representative sequences, truncated them with cutadapt, and then imported them back into QIIME 2
that imported file might now contain duplicates, because you cut off the v5 region in those sequences, and it's possible the v4 regions of some of them were identical
you want to remove any duplicated truncated sequences, so that you once again have only one representative sequence per ASV?

Before we go down that scary route , have you experimented with truncating the raw fastq/fasta files in cutadapt, importing them, and then using Deblur to denoise after truncating? Its static error model might play nicer with the data you've manipulated than DADA2's error model does, and if it works then

If that doesn't pan out, two other things you could try are:

vsearch, quality-filter, and OTU clustering might be more permissive with your use case, if generally not as awesome as denoising to ASV. I have zero experience here, but if you're running into dead ends with denoising, it may be an option.
You might be able to get away with the following hacky garbage (which does not mean you should! ). I suspect it will create more problems than it solves.

Here be
If you import your truncated rep-seqs as SampleData, q2-vsearch dereplicate-sequences could give you a dereplicated FeatureData[Sequence], with new Feature IDs based on the truncated sequences. I think this is what you're asking for above. It would also give you a table you'd have to throw out, which raises the problem with this approach - if the feature IDs in your rep-seqs (which you'll use to build a taxonomy) don't match the feature IDs in your table (which are probably hashes of the untruncated sequences), you're liable to get an error like this if you try to build a taxa barplot. I'm not sure what other methods for downstream analysis rely on both your rep-seqs and your feature table, but similar errors may pop up with any of them. Cleaning up your rep seqs won't clean up your feature table, which is why I'd recommend experimenting with other approaches first.