dada2 or vsearch

yogurt · December 7, 2020, 9:18am

Hello, me again.

I got lot of help from forum. And now I made a alpha, beta diversity analysis and even PERMANOVA test! I really appreciate qiime developers.

All of a sudden, I had a question.
I have followed two different approaches using qiime2 to analyze my amplicon data (COI)

denoise with dada2-> “ASV table” -> diversity metrics
denoise with vsearch-> VSEARCH (97, 98, 99% thresholds) -> “OTU table” -> diversity metrics

Which way would be in principle the most appropriate? I understand ASVs concept is somewhat close to the 100% OTU clustering. And ASVs sequence is the exact sequence.
In contrast, OTU clustered sequences are the consensus seqs. So it might not be the exact biological sequence. I read a lot of papers says ASVs is more powerful than OTU.

And my coworker asks that if ASVs seqs are 100% OTU, could I make ASVs table by
vsearch(100%thresholds) -> OTU clustering -> diversity analysis?

here is brief.

I think ASVs are more appropriate than OTUs. What is your recommendation by making ASVs table. by Dada2? or by Vsearch? (I choose Dada2, but as there are many options, my coworker's opinion is vsearch).

Thanks again!!

llenzi · December 7, 2020, 9:52am

Hi @yogurt,

just a very quick answer!
vsearch will group your sequences in clusters, which will be defined as any sequences with a fixed number of differences from a given centroid sequence (it may be a simplify definition but I hope you got the point, if not please look at vsearch documentation.). Vsearch will the output the centroid sequences not the consensus, which is a totally different concept to me. The usual difficult with the clusters is that two sequences located at extremity of a cluster still have the same, defined, difference from the centroid, but it is unclear what is the difference between these two (which may be up to twice the differences defined to set a cluster). Hope make sense so far.
Moreover, 100% threshold-clustering is not the same as ASVs creation. The main difference is that dada2 error-correct the sequence, predicting the original amplicon sequences from which the reads were obtained.
Clustering is merely 'sorting sequence out' in groups. Sequencing errors will be reflected in many spurious clusters.
If you want to cluster, you should consider an error correction step in any case. Some may choose to use dada2 as denoiser, then apply vsearch to cluster the denoised sequences, an alternative may be to generate zOTU (zero-radiance OTUs, basically after clustering denoising, see Generating OTUs and ZOTUs, currently not available in qiime2).

Being lazy, I just stick to dada2 (or debulr) only ...

Hope it helps

yogurt · December 7, 2020, 2:19pm

Thank you @llenzi ,

Oh, I misunderstood ASVs might be the same if I use vsearch and cluster with 100 threshold. So I thought I could make ASVs table with DADA2 and Vsearch. But as you mentioned, if I use vsearch or debur, I might not be able to produce ASVs table right?

And also I am trying to use DADA2. However not like Vsearch, there weren't --p-maxns option in DADA2. (--p-maxns : remove N sequences). Whereas there is no N trimming option in DADA2.
This is the very reason my coworker was worried about.
If there are N sequences in the dada2 output (rep-seq.qza), can I ignore it?

llenzi · December 7, 2020, 2:51pm

Hi @yogurt
With deblur you get a kind of ASVs, please look at:

In which Mehrod brilliantly discussed all about ASVs and clustering!

For the Ns, there are options to filter out input sequences with errors, please see '--p-max-ee-f'
and '--p-max-ee-f' in denoise-paired: Denoise and dereplicate paired-end sequences — QIIME 2 2020.8.0 documentation

Hope it helps

yogurt · December 7, 2020, 5:34pm

Thanks @llenzi ,
You are truly a lifesaver!

--p-max-ee-f NUMBER
Forward reads with number of expected errors higher than this value will be discarded.

This is the option you mentioned. And the default is 2.0.
So can you please tell me what is this 2.0 means to me?

For example, if I put 10, those reads that have 10 N out of 100 base might be discarded? Does this expected error means N?

Thanks again llenzi.
I hope one day I would help someone with qiime just like you

So far, it really helped a lot!

llenzi · December 7, 2020, 6:29pm

Hi @yogurt ,
Let see if I can help in here too.
I suppose the best things to see what is an error in the context of dada2 is by reading its tutorial
https://benjjneb.github.io/dada2/tutorial.html
and:
https://www.bioconductor.org/packages/devel/bioc/vignettes/dada2/inst/doc/dada2-intro.html
I forgot that dada2 requires no Ns at all! In fact any Ns means the quality is so low that the Illumina pipeline could not call a base.

My understanding is that dada2 filter out any reads containing Ns, as well as any reads with number of identified errors above the threshold (I assume the options above expose the 'maxEE' settings in dada2).
In a normal situation, you most likely have Ns at the tail of the sequences, so the idea is to change the trimming parameters to exclude these Ns from the dada2 processes. Then, after applying the error model, dada2 will retain any sequences with error count less than 2 (if you keep maxEE default setting).

Hope it make sense (last time I looked at dada2 manuals was a while ago!)
Cheers