Trouble training sk classifier to identify particular protist

SoilRotifer · April 28, 2023, 2:14pm

Short answer... it could be all of the above.

When you say that you appended to the SILVA nr database. How did you prepare the database prior to training the classifier? Did you use an approach similar to what is outlined here, something else?

I ask because, often longer / full-length sequences might be able to differentiate among taxa, but the extracted amplicon region may not be able to do so. That is, the targeted amplicon region may contain identical sequence across disparate taxa. Thus, losing taxonomic resolution in being able to differentiate some taxa from one another. I suspect that is what is happening here.

What happens when you run qiime rescript dereplicate ..., using the --p-mode lca option, on extracted V4V5 region? Do you still observe the taxonomic groups of interest in the output? If not, then this means that there are identical sequences with differing taxonomies in the reference database. Compare with the --p-mode uniq option, which will keep replicate sequences in the file only if there is a differing taxonomy.