I feel stuck in choosing the best strategy for assigning taxonomy. I have quality-controlled and denoised my data by DADA2. Next step is assigning them to taxa. Here are some options in my mind:
Further clustering OTU by 99% using close-reference or open-reference method and then assigning taxonomy based on clustered table. The question is if it is still necessary to cluster them after denosing since sequences are supposed to have been cleaned very well by DADA2. If we need clustering, shall we trim reference reads to the same length(e.g., 150bp) as in custom data as input? any suggestion in using close or open reference clustering for human saliva tissue?
No clustering but keeping each sequence uniquely. Then classifying them by training classifier from reference. Here reference reads are usually trimmed to the same specified length according to the tutorial. Besides classification, there are also some clustering (e.g., blast+ or vsearch) methods introduced in the plugin. Is there a comparison between both to show which one can have better assigning accuracy?
Any suggestion on which strategy should I use in my data?
Further clustering should not be necessary. I would also discourage relying on the reference OTU taxonomy — this will be the closest match taxonomy, but not necessary the correct one (i.e., other close hits could exist that have different taxonomic assignments).
I would recommend this approach.
That’s true but it’s not absolutely essential (in my experience it only provides a slight increase in accuracy). The full-length amplicon or 16S rRNA gene subdomain is usually sufficient. E.g., if you are using V4 reads, you can use the pre-trained V4 classifiers that we provide here without needing to train/trim your own. If you are using a different primer set or non-16S data, you should train your own classifier.
You can check out this preprint for now. All are pretty similar with optimal parameter settings, though classify-sklearn is slightly better than the others for 16S rRNA data.
Thanks @Nicholas_Bokulich. I also want to want to compare qiime2 suggested pipeline (denosing by DADA2) with the pipeline previously used in qiime1 for curiosity. But I am not familiar with qiime1 pipeline since I started by qiime2. Do you know any work having done this?
Another thing I am confused is there is a trimming of sequence step in qiime2 tutorial but not in qiime1. Is there any reason for that? Or any way/method we can use to trim sequence by quality score like we did in qiime2 but not involving another processing?
I would strongly recommend sticking with denoising if you are just getting started in QIIME2 (as opposed to transitioning from QIIME1; many transition users are comfortable with OTU pipelines and/or need to process data for comparison to older results, but that does not sound like the case here)
If you do want to perform OTU picking in QIIME2, several methods (mirroring those supported in qiime1) are available in the q2-vsearch plugin. These should be accompanied by rigorous sequence quality filtering (see below) and chimera checking.
Sequences are usually trimmed in qiime1 during demultiplexing. DADA2 trims sequences to a set length but this is manually set by the user — qiime1 actually performs more what it sounds like you are describing, trimming sequences programmatically wherever quality begins to drop below a defined level (and hence sequences are of different lengths post-filtering)
qiime1-style quality filtering can be performed with the q2-quality-filter plugin. This step is not necessary if you are using DADA2, but is needed if you are using deblur or OTU picking. If passing these inputs to OTU picking in particular, see here for recommended parameter settings for q2-quality-filter (the defaults are minimal and more useful for following with q2-deblur for additional quality filtering).