Dereplicate sequences after DADA2

JadeS · May 31, 2021, 6:27pm

Hello! I am attempting to analyze SRA data along with my own. I have a total of 7 16S rRNA sequencing runs, some of which have V4 amplicons and the rest have V4-V5. So, I am trying to cut off the V5 region at the 806r V4 primer site. When I used cutadapt on individual reads to remove the V5 region, a huge proportion of my reads were being filtered out by DADA2 when compared to DADA2 runs on the uncut reads. Since the reverse reads tapered off in quality quite quickly, I decided to proceed with DADA2 normally, then export the whole amplicon and use cutadapt in the command line, then import that back into QIIME2. This seemed to work much better, however, I now need to dereplicate my data but the representative sequences have been named using the old sequence names. How can I dereplicate the representative sequences after DADA2?

The second related issue I am having is that some of the SRA runs appear to be completely reverse complimented compared to my data. How can I reverse complement all of the representative sequences? Once I do that, they will still be named by their previous sequence so I have the same dereplication issue as above. Is there a way to combine the representative sequences and the table in order to export them as SampleData[Sequences] so that I can dereplicate them? Would that even matter since the dereplication is done by name?

Lastly, do you know of any literature or examples where people analyzed different hypervariable regions? I have been trouble finding anything aside from huge meta-analyses on human microbiomes which is not what I am going for.

Thank you so much. I really appreciate the support on this forum!

ChrisKeefe · June 3, 2021, 11:12pm

Hi @JadeS!
This is a little out of my league, but let's see what we can figure out together! First, would you mind opening separate topics for your SRA and literature questions? (SRA can go in User Support, lit should go in General Discussion). I'd like to keep things from getting too complicated here.

Am I understanding correctly that:

you ran DADA2 on your data, exported the representative sequences, truncated them with cutadapt, and then imported them back into QIIME 2
that imported file might now contain duplicates, because you cut off the v5 region in those sequences, and it's possible the v4 regions of some of them were identical
you want to remove any duplicated truncated sequences, so that you once again have only one representative sequence per ASV?

Before we go down that scary route , have you experimented with truncating the raw fastq/fasta files in cutadapt, importing them, and then using Deblur to denoise after truncating? Its static error model might play nicer with the data you've manipulated than DADA2's error model does, and if it works then

If that doesn't pan out, two other things you could try are:

vsearch, quality-filter, and OTU clustering might be more permissive with your use case, if generally not as awesome as denoising to ASV. I have zero experience here, but if you're running into dead ends with denoising, it may be an option.
You might be able to get away with the following hacky garbage (which does not mean you should! ). I suspect it will create more problems than it solves.

Here be
If you import your truncated rep-seqs as SampleData, q2-vsearch dereplicate-sequences could give you a dereplicated FeatureData[Sequence], with new Feature IDs based on the truncated sequences. I think this is what you're asking for above. It would also give you a table you'd have to throw out, which raises the problem with this approach - if the feature IDs in your rep-seqs (which you'll use to build a taxonomy) don't match the feature IDs in your table (which are probably hashes of the untruncated sequences), you're liable to get an error like this if you try to build a taxa barplot. I'm not sure what other methods for downstream analysis rely on both your rep-seqs and your feature table, but similar errors may pop up with any of them. Cleaning up your rep seqs won't clean up your feature table, which is why I'd recommend experimenting with other approaches first.

JadeS · June 4, 2021, 5:06pm

Thank you so much for your reply! Yes, I will definitely open up a different topic for the literature. I think you are right that it might be less messy if I trim/reverse complement the sequences before importing. I will try that out instead, although I am pretty unfamiliar with manipulating .fastq files. Do you know if there is any way to reverse complement .fastq files?

OTU clustering could be a good option as well! Does the OTU clustering go off of the sequence names (which would be inaccurate after trimming)? Would it be better to do OTU clustering on each run or after merging all of the OTUs? In this case, I would just skip DADA2, right?

Yeah, I know that the feature table wouldn't match. That is why I was thinking of somehow exporting as a fasta and then re-importing so that qiime2 has to rename the sequences. I was just not sure whether I can export as a .fasta once the information is split into the rep-seqs and the table.

Thanks again!

ChrisKeefe · June 4, 2021, 11:35pm

I don't know of one, but there may be a tool out there, and someone else here might know where to look. It's worth opening a new topic for! You can probably just edit your SRA paragraph above.

Clustering will probably work, but will get you lower sequence resolution and may not provide the same types of quality control available with contemporary denoising methods. I'd try Deblur first, but YMMV.

Clustering is sequence-based, and I think you'll find that's also true for many sequence dereplication tools. It's worth mentioning that the feature IDs you see here, and the ones produced by vsearch dereplicate-seqs are hashes of the sequences themselves. In this context, I'm not sure whether hashes guarantee uniqueness, but they will be definitely be unique to the sequence unless most of the time. (Hash collisions occur ~50% of the time with a pool of 5 billion sequences using vsearch's approach, and very rarely with the number of sequences you're likely to see in a normal study's representative sequences.) That's just a long way of saying you're probably pretty safe using hash-based IDs for dereplication, though I'm not sure whether that's the approach being used here.

I haven't done any work with clustering myself, but there's a tutorial and probably a bunch of topics on this forum if you search around.

yep

The original sequence data is not preserved in FeatureTable artifacts, and they export as biom tables. You could probably construct a new fasta from a biom table, but that would require some pretty gross hacking. Considering these artifacts are both produced from fastq/fasta files, I suspect it will be easiest to do your trimming first. Again, no experience with this specific workflow, but this approach feels much more straightforward.

system · July 6, 2021, 5:35am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.