Aligning my sequences to SILVA


I demultiplexed my sequences and then quality filtered using dada2. I made the featuretable and featuresummaries as well. Now I got to the point where I need to align them to the reference database. I have heard that SILVA works much better than greengenes for 16s data. However, I am not sure where to start since qiime 2 tutorial is based on open reference and closed-reference seem to be very complicated. Can someone help me with that? Also, qiime 1 had the option of doing both close reference and open reference at the same time so that if some reads are not aligned to the reference database, then open reference alignment will be run for them. Is that option available for qiime2?


Hi @Negin,
I recommend reading this tutorial to give you a better idea of the many different analysis options in QIIME 2. There is no one way to do anything in QIIME 2.

I really have no clue where you get this idea — there are many tutorials in QIIME 2, most of which use denoising rather than clustering methods.

See this tutorial — you are looking for open-reference. (what you were calling open-reference sounds more like de novo — open-ref is closed-ref followed by de novo).

I hope that helps!

Oh yes, I was talking about de novo. I will look at the link. Thanks!

I looked at the tutorial and links you sent. I want to use the open reference option but I imagine I need to first train the reference database (SILVA 99%). I downloaded the lastest version of QIIME compatible ones from here. I unzipped the file and am looking at the taxonomy files and the sequences. There are many different ones. I imagine for the sequence, I need to get the rep-set-aligned 99%, the one that is called For the taxonomy file, should I use the one from the 16s folder that is called consensus_taxonomy_all_levels? There are raw-taxonomy and other files too.

no you do not need to train anything. You do need to download the correct sequence file, but you do not need a taxonomy file.

No, use unaligned sequences.

For any QIIME 2 command, use the --help flag to see the help documentation for that command. E.g., use:

qiime vsearch cluster-features-open-reference --help

To see the expected inputs and their descriptions. No taxonomy required.

You can also qiime tools export any of the example files in the tutorial to see what they look like on the inside — e.g., to see if that is an aligned or unaligned fasta file (you can also use qiime tools peek to see the semantic type but that only helps if you know that there are aligned vs. unaligned types. See a short list here).

hmm then when should we train as described here

That tutorial describes how to train a naive Bayes classifier for taxonomy classification. That is a distinct process from OTU clustering — the overview tutorial that I linked to above puts the overall process in perspective. Furthermore, training is only done for that taxonomy classifier — other taxonomy classification methods (e.g., classify-consensus-vsearch) do not have a training step.

The OTU clustering tutorial that I linked to describes everything you need to perform OTU clustering.

Good luck!

I do think I confused clustering with the alignment. I did the OTU clustering using dada2 so I am done with that step. I am now at the step for making the tree. So I was running ahead of myself. Sorry about that. QIIME2 has so many more options than QIIME1 and it will take a while to get used to the whole thing. Sorry for bothering you so much but I will get the concept at some point :smile:

Well, I am also a bit confused now. If I understand correctly, the OUT clustering is trying to group the sequences into OUT based on identity. The taxonomy classification is trying to tag the grouped OUTs. For example, if we use the species level .
After we get the rep-table and rep-seq data using DADA2, should I perform the OTU clustering with 0.97 (de novo, closed-reference or open-reference) to get the grouped OTU? After that, using the new rep-seq and rep-table to analyze the alpha and beta diversity, and the taxonomic analysis? (Like what is done in the “Moving Pictures” tutorial, but that tutorial didn’t perform the OUT clustering.)

Hi Xianzhe,

So what is happening here is that when you perform filtering using dada2 and then get the featuretable, you can make the tree using mafft which is actually OTU clustering. Open-reference, close-reference and de novo are other approaches for doing the same thing. What we are doing at this stage is to link different OTUs together rather than aligning them to a reference database. The reason we do this is because we want to perform alpha and beta diversity analysis, and some alpha and beta diversity metrics (PD whole tree, for example), take the phylogenetic distance between the OTUs into consideration so that is why we do OTU clustering. This is different from taxonomy analysis when we are actually aligning the reads to reference database of our choice (Greengenes, SILVA, etc.) to actually find out what taxa we have in our sample.

So in short, you can go ahead and use mafft in Moving Pictures tutorial instead of open-reference for the OTU clustering so right after you make your featuretable.

Yes, these are separate processes. Alignment is used for many distinct things!

OTU clustering is a form of alignment, but specifically for dereplicating similar sequences (and in the process theoretically removing noisy sequences by clustering them into the centroid).

Multiple sequence alignment is performed for the purposes of building a phylogeny.

Taxonomy classification can be based on alignment (e.g., see classify-consensus-vsearch) but the taxonomy step you mention (training a classifier) uses a naive Bayes classifier trained on kmer frequency information, not a full sequence alignment.

dada2 is a denoising method, not an OTU clustering method. Read this for more details.

Yes! You have the right idea.

No. dada2 performs dereplication. Further clustering is not necessary, though I have heard from some users who do this.

Correct — because it is not necessary after denoising. See the paper linked to above. Denoising methods remove the noise so that OTU clustering needn’t be used for rough denoising.

No, these are unrelated, see my comment above.

Not correct.


1 Like

Hi Nicholas,

What I ended up doing for the taxonomy analysis was to train silva sequences based on

qiime tools import
–type ‘FeatureData[Sequence]’
–input-path 85_otus.fasta
–output-path 85_otus.qza

qiime tools import
–type ‘FeatureData[Taxonomy]’
–input-format HeaderlessTSVTaxonomyFormat
–input-path 85_otu_taxonomy.txt
–output-path ref-taxonomy.qza

qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads ref-seqs.qza
–i-reference-taxonomy ref-taxonomy.qza
–o-classifier classifier.qza

I skipped this code because I was not sure what to put for min and max length:

qiime feature-classifier extract-reads
–i-sequences 85_otus.qza
–p-trunc-len 120
–p-min-length 100
–p-max-length 400
–o-reads ref-seqs.qza

Is that okay?

That’s fine, as explained in the tutorial you are following that step is optional but does increase accuracy a bit (that tutorial also explains how to select parameters).

Do not use the files in that tutorial for your own real datasets — read the notes in the tutorial for more details!

Good luck!

1 Like

I am not using those files. I am using SILVA 97%. I just pasted the general codes here. Thanks!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.