Weird Taxonomic classification 28S rRNA

SoilRotifer · February 12, 2021, 3:10pm

Hi @HugoEira, welcome to !

I think you are one of the first to inform us about using RESCRIPt for SILVA LSU data!

If you use your LSU classifier to classify your reads, and observe as the final classification that the upper level taxonomy is propagated downward through all of the ranks, then this means that you scored a hit to a specific sequence in the database that had no information other than 'Eukaryota'. That is, during the initial steps of constructing your database, either through using the pipeline get-silva-data or the action parse-silva-taxonomy, the default setting is to propagate taxonomy with --p-rank-propagation. This ensures there are no empty ranks in the output for some use cases and tools that do not like empty ranks. This can be disabled by using the flag --p-no-rank-propagation to obtain the d__Eukaryota only label. That is, if the resulting classification returns a taxonomy that is not truncated as in the second case, then the classifier is simply returning a specific hit (or hits) to a sequence (or sequences) that all contain the full d__Eukaryota; ... g__Eukaryota string. That is, you may want to consider removing any sequences from your reference database, prior to making the classifier, that do not have at least a phylum or other taxonomy information, as they are not particularly helpful.

So, in these cases you are observing two different classification results. The first is the case that we discussed above. The second, where you only observe d__Eukaryota, is due to the fact that the classifier could not identify, beyond the domain level, what your query sequence was. That is, there could have been several equivalent hits to different Eukaryotes within the database (i.e with different taxonomies), and only the lowest common ancestor taxonomy was returned. That is all the lower ranks are cropped except for the upper most rank that it had reasonable confidence.

Hopefully this makes sense. Anyway you are on the right track!

Here are some additional tips worth considering:

I would recommend playing around with the sequence lengths for this command. The numeric values were tailored to the 16S / 18S rRNA gene data. Since LSU is larger you may want to consider increasing these? I've not benchmarked these, but just something work considering.

You can also pick and choose which ranks you'd like in your classifier too. For more details see the tutorial, and this thread:

Also, do not forget to read our warning about --p-include-species-labels within the Species-labels: caveat emptor! drop menu in the tutorial.

-Good luck and keep us posted!
-Mike