Inconsistent Taxonomic Assignments Between Two Datasets - Error with Classifier?

2017.3-taxa-bar-plots.qzv (349.9 KB) 2019.3-taxa-bar-plots.qzv (454.7 KB) merged3-taxa-bar-plots.qzv (415.5 KB)

I've been struggling with two 18S amplicon datasets being classified inconsistently. They were both prepared the same way; however, one dataset contains paired-end, 2 by 250 reads, while the other consists of paired-end, 2 by 300 reads. I ran DADA2 on the two datasets separately, and then merged them into one dataset before performing data analyses. My lab produced a classifier for these datasets, and I've looked over it pretty extensively, but have not been able to find any errors with it. Regardless, when I perform "qiime feature-classifier classify-sklearn" using my custom classifier, one particular protist genus (Cthulu) that SHOULD be present in my samples doesn't show up in the merged dataset. This is strange, because when I produce taxa bar plots for the two datasets independently, Cthulu shows up in one dataset, but not the other. I also noticed that, in the dataset in which Cthulu is present, the taxa bar plot indicates that my samples were classified to a taxonomic level of 9. This presents itself in the legend of the taxa bar plot as two spaces after the species level on each species present in the samples (ex. "k__Eukaryota;p__Parabasalia;c__Spriotrychonympa;o__Spirotrichonymphida;f__Holomastigatoididae;g__Holomastigatoides;s__tenuis;;"). In contrast, in the dataset that Cthulu does not show up in, the taxonomic level to which my samples were classified is Level 7. This makes sense, because my classifier only classifies to level 7 (species level); however, it does not make sense that Cthulu isn't in a single one of my samples from that dataset.
When I merge the two datasets and produce a taxa bar plot of all of the samples together, the results align with the latter dataset mentioned - they are classified to a taxonomic level of 7, and Cthulu does not show up in any of the samples. Strange!

I attached the qzv files of the taxa bar plots to demonstrate the discrepancies. Has anyone encountered a similar problem? I'm not really sure what I'm doing wrong. I'm pretty new to using Qiime, but I've tried trimming my reads to different lengths in DADA2 (including ensuring that my two datasets were trimmed to the same length before merging the two together), I've tried changing the confidence level on my classifier from 0.7 to 0.6 and 0.5, and I've looked through the files used to make my classifier, and I just can't figure out the issue!

Any help is appreciated. Thanks!

Welcome to the forum, @ncoots!

The naive Bayes classifier in q2-feature-classifier should not exhibit random behavior. The only time it exhibits inconsistencies like this is when reads (or reference sequences) are in mixed orientations. The classifier determines the orientation of reads (relative to the reference database) by polling the first 100 or so seqs. So feeding the classifier a different set of sequences (e.g., unmerged vs. merged datasets) could lead to variable results if the sequences are in different orientations in one run (or both).

The easiest fix may be to just use classify-consensus-vsearch. That method is not sensitive to read orientation so it is worth taking a look and seeing what you get.

So with all that in mind, I must ask: do you expect to see Cthulu? I only see a little bit of Cthulu there and so it warrants asking: if this pseudo-random behavior is caused by reversing the orientation of reads relative to the reference database, which classification is correct? is Cthulu really there? Maybe classify-consensus-vsearch can help you decide if Cthulu really exists or not.

The truth is out there. Good luck!


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.