I have an issue with my classifier output, I searched around on the forum but could not find the answer (if it is here, sorry!). I have a marine 18S dataset that I am classifying using a subset of the silva-138-99-seqs.qza from QIIME data resources. The thing is that there is duplication in the taxonomic output, as seen below:
So from the 'class' onwards it is somehow giving the same result (Trebouxiophyceae or Ulvophyceae in this example). I wonder what is going on here.. I cannot really do any diversity analysis on this result, especially on the lowest hit, Cladophora vagabunda, where the genus is still showing the class Ulvophyceae.
Below my (slightly simplified) steps to obtain the classifier and how I used it to get these results:
This is expected, as each unique feature is retained despite having identical taxonomy by default through our RESCRIPt plugin. As multiple sequence variants can derive from the same taxonomic group.
When this taxonomic information is missing we propagate the taxonomic information, from the last observed upper-level taxonomic rank, downward. You can read more about why we do this within the RESCRIPt tutorial. Specifically, under the drop menu entitled Rank-Propagation.
In a nutshell, not all reference sequences within a given reference database contains complete taxonomic information. See the entry JX127171, for example. As you can see this entry only contains taxonomic information down to "Trebouxiophyceae". Thus, this label is propagated downward for all other ranks. Again, the paper and the tutorial linked above explain in far more detail why we do this. You can certainly disable rank-propagation if you'd like to make your own reference database with RESCRIPt.
I've read the Rank-Propagation drop down menu, that was very useful. I'm still struggling a bit with this though, as I do see the added value of having the extra taxonomic information this technique provides, but not in having this propagated down to genus (where it will erroneously be used as an extra genus in e.g. alpha diversity).
Also, if I look at e.g. the lowermost hit in my screenshot (Cladophora vagabunda), all the 21 accessions in Silva have the correct taxonomy in place (Eukaryota, Archaeplastida, Chloroplastida, Chlorophyta, Ulvophyceae, Cladophorales, Cladophora, Cladophora_vagabunda), still I get Ulvophyceae propagated from the Class onwards and even in the genus..?
@marcelpolling, I assume you are using the SILVA 138 version of the reference database from the Data resources page? If so, you may want to consider using RESCRIPt to generate your own database using the corrected 138.1 version (which is now the default download option for the get-silva-data command. The 138.1 version corrected many taxonomy issues like this one. Apparently, we forgot to update to 138.1 when generating the reference files for the latest release.
This is a great example of why we made RESCRIPt, many sequence-taxonomy reference databases are continually being updated. Users can simply fetch the version they'd like and continue.
Ignore the listed sequence counts. The 138 version was from the Data resources page went through some QA/QC sequence filtering, the 138.1 version is from the raw file I just downloaded w/o QA/QC sequence filtering