SILVA reference taxonomy (consensus, majority, 7 levels, all_levels) - file selection with unexpected test results


Dear Qiime team.

We are working on a 18S dataset using the SILVA database for reference.
There are several files available in the download provided by Qiime and made available on the Silva webpage In order to choose the most adequate one, we tested all the different SILVA 132 reference-taxonomy files with exactly the same Qiime 2 preprocessed 18S amplicon sequence raw data set to check the similarity of the results. We agreed on using the following reference data set: all (instead of 18S only), 99%, consensus.
Then, we tested the consensus_taxonomy_7_levels file vs. the consensus_taxonomy_all_levels file. Here, we expected similar results (see citation of the Silva notes). But actually, we see a difference in the results - already in the third Level. Which shouldn’t be there, as far as I understood the Silva notes.
One example: The 3rd rank Archaeplastida was observed using the consensus_taxonomy_7_levels reference taxonomy, but was totally absent with the consensus_taxonomy_all_levels taxonomy.

What could be an explanation for that and which file would you recommend us to use with our eukaryot data set?

Thank you!!

P.S.: Sorry, if I chose the wrong category for my question.

[Silva 132 release, notes-textfile]: “This has a consequence that the first 7 levels match domain through species for most Archaea, Bacteria, and many eukaryotes, but due to the extra levels present in many eukaryotes, one will have to look at deeper levels to get the species in many cases. When viewing taxonomy plots generated with these taxa strings, one will need to be aware that the expanded format may result in unmatched taxa levels (e.g. a species level for a bacterial taxon may be family level for a fungi taxon). The 7 level taxonomy uses 7 levels if they are present. If more than 7 levels are present, the first 3 and last 4 levels of taxonomy are used.”

(Nicholas Bokulich) #2

The “all levels” can confuse taxonomy classifiers, since many consensus and confidence-based classifiers assume that taxonomic ranks are even across samples (e.g., the 3rd level is always class). So any differences between the “all levels” and “7 levels” is probably due to this confusion.

One in which the taxonomic ranks are uniform, e.g., the 7-level taxonomy.

Good luck!