What are differences for the files in Silva database which used to train the classifier?

Nicholas_Bokulich · October 14, 2019, 10:56pm

The two files are different. Look for "Ambiguous_taxa" labels in the consensus taxonomy and compare to taxonomy_7_levels... the ambiguous taxa occur in consensus taxonomy because there is not consensus at that level, but the fill label is listed in taxonomy_7_levels because (presumably) it is the cluster rep seq.

"all levels" presents problems for really any classifier, since the taxonomy becomes a knotty mess (and this is not an issue specific to q2-feature-classifier either). The BLAST and VSEARCH-based classifiers will have the same issue unless if you use maxaccepts=1, which is still going to grab the top hit (like classify-sklearn confidence=-1) so is sub-optimal.

The best solution is really to use the 7-level taxonomy if you can...

You need to import to QIIME 2 — see the feature classifier tutorial on qiime2.org for specific examples.

Dear oh dear — mixed orientation is bad news and not just for taxonomy analysis. dada2 is effectively going to duplicate all ASVs, because the reverse complement of any ASV is a new ASV. Make sense? That's bad news for all analyses, especially if the samples are stratified by orientation.

Yes — classify-sklearn looks at the first 100 or so seqs to decide the orientation, and classifies based on that. Your mixed orientations leave it confused .

You have a few solutions. Fortunately, it sounds like your samples are stratified by orientation (e.g., sample 1 is all in forward orientation and sample 2 is all reverse) . So you could:

[BEST] reverse the orientation of any reads in the reverse orientation and proceed (starting with dada2).
classify your sample sets in two sets, separated based on read orientation

But perhaps I misunderstand and all samples are in mixed orientations, in which case use classify-consensus-vsearch, which can already handle mixed-orientation reads.

Yes! VSEARCH comes pre-installed with QIIME 2 and has a method to reverse read orientations. This will only be useful if your read orientation is stratified by sample, not if all samples are in mixed orientations.