I demultiplexed my sequences and then quality filtered using dada2. I made the featuretable and featuresummaries as well. Now I got to the point where I need to align them to the reference database. I have heard that SILVA works much better than greengenes for 16s data. However, I am not sure where to start since qiime 2 tutorial is based on open reference and closed-reference seem to be very complicated. Can someone help me with that? Also, qiime 1 had the option of doing both close reference and open reference at the same time so that if some reads are not aligned to the reference database, then open reference alignment will be run for them. Is that option available for qiime2?
I looked at the tutorial and links you sent. I want to use the open reference option but I imagine I need to first train the reference database (SILVA 99%). I downloaded the lastest version of QIIME compatible ones from here. I unzipped the file and am looking at the taxonomy files and the sequences. There are many different ones. I imagine for the sequence, I need to get the rep-set-aligned 99%, the one that is called 99_alignment.fna.zip. For the taxonomy file, should I use the one from the 16s folder that is called consensus_taxonomy_all_levels? There are raw-taxonomy and other files too.
To see the expected inputs and their descriptions. No taxonomy required.
You can also qiime tools export any of the example files in the tutorial to see what they look like on the inside — e.g., to see if that is an aligned or unaligned fasta file (you can also use qiime tools peek to see the semantic type but that only helps if you know that there are aligned vs. unaligned types. See a short list here).
That tutorial describes how to train a naive Bayes classifier for taxonomy classification. That is a distinct process from OTU clustering — the overview tutorial that I linked to above puts the overall process in perspective. Furthermore, training is only done for that taxonomy classifier — other taxonomy classification methods (e.g., classify-consensus-vsearch) do not have a training step.
The OTU clustering tutorial that I linked to describes everything you need to perform OTU clustering.
I do think I confused clustering with the alignment. I did the OTU clustering using dada2 so I am done with that step. I am now at the step for making the tree. So I was running ahead of myself. Sorry about that. QIIME2 has so many more options than QIIME1 and it will take a while to get used to the whole thing. Sorry for bothering you so much but I will get the concept at some point
Well, I am also a bit confused now. If I understand correctly, the OUT clustering is trying to group the sequences into OUT based on identity. The taxonomy classification is trying to tag the grouped OUTs. For example, if we use the species level .
After we get the rep-table and rep-seq data using DADA2, should I perform the OTU clustering with 0.97 (de novo, closed-reference or open-reference) to get the grouped OTU? After that, using the new rep-seq and rep-table to analyze the alpha and beta diversity, and the taxonomic analysis? (Like what is done in the “Moving Pictures” tutorial, but that tutorial didn’t perform the OUT clustering.)
So what is happening here is that when you perform filtering using dada2 and then get the featuretable, you can make the tree using mafft which is actually OTU clustering. Open-reference, close-reference and de novo are other approaches for doing the same thing. What we are doing at this stage is to link different OTUs together rather than aligning them to a reference database. The reason we do this is because we want to perform alpha and beta diversity analysis, and some alpha and beta diversity metrics (PD whole tree, for example), take the phylogenetic distance between the OTUs into consideration so that is why we do OTU clustering. This is different from taxonomy analysis when we are actually aligning the reads to reference database of our choice (Greengenes, SILVA, etc.) to actually find out what taxa we have in our sample.
Yes, these are separate processes. Alignment is used for many distinct things!
OTU clustering is a form of alignment, but specifically for dereplicating similar sequences (and in the process theoretically removing noisy sequences by clustering them into the centroid).
Multiple sequence alignment is performed for the purposes of building a phylogeny.
Taxonomy classification can be based on alignment (e.g., see classify-consensus-vsearch) but the taxonomy step you mention (training a classifier) uses a naive Bayes classifier trained on kmer frequency information, not a full sequence alignment.
dada2 is a denoising method, not an OTU clustering method. Read this for more details.
Yes! You have the right idea.
No. dada2 performs dereplication. Further clustering is not necessary, though I have heard from some users who do this.
Correct — because it is not necessary after denoising. See the paper linked to above. Denoising methods remove the noise so that OTU clustering needn’t be used for rough denoising.