I have a mock community consisting of 24 representative sequences. I’ve classified these using the Naive Bayes classifier as well as VSEARCH and BLAST classifiers in QIIME 2. I also have classified the same set using a fourth classifier (the VSEARCH-written version of SINTAX here). The same database served as input for all four classifiers in principle; in practice, the original set of reference sequences used for VSEARCH/BLAST alignment was first trained within QIIME 2’s fit-classifier-naive-bayes
and then that was used for classification, while for SINTAX I used the original fasta file that was imported as a .qza
object for VSEARCH/BLAST.
I’m concerned and confused why the two alignment-based approaches are generating a lot of Unassigned
taxa, whereas the Naive Bayes classifier and the SINTAX classifier are assigning taxonomic information to every representative sequence, often to the Genus or Species rank.
For example, the VSEARCH command:
qiime feature-classifier classify-consensus-vsearch \
--i-query path/to/mockIM4_seqs.qza \
--i-reference-reads path/to/boldCOI.derep.seqs.qza \
--i-reference-taxonomy path/to/boldCOI.derep.tax.qza \
--p-maxaccepts 1000 \
--p-perc-identity 0.97 \
--p-query-cov 0.94 \
--p-strand both \
--p-threads 12 \
--o-classification mockIM4.vsearch_out.qza
… generated this output:
Feature ID Taxon Confidence
MockIM10 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Tortricidae;g__Choristoneura 1.0
MockIM15 Unassigned 1.0
MockIM16 Unassigned 1.0
MockIM20 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Geometridae;g__Haematopis;s__Haematopis grataria 1.0
MockIM21 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis 1.0
MockIM23 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis 1.0
MockIM27 Unassigned 1.0
MockIM28 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Erebidae;g__Hyphantria;s__Hyphantria cunea 1.0
MockIM29 Unassigned 1.0
MockIM3 k__Animalia;p__Arthropoda;c__Insecta;o__Diptera;f__Culicidae;g__Aedes;s__Aedes vexans 1.0
MockIM32 Unassigned 1.0
MockIM33 Unassigned 1.0
MockIM39 Unassigned 1.0
MockIM4 Unassigned 1.0
MockIM40 k__Animalia;p__Arthropoda;c__Insecta;o__Blattodea;f__Blattidae;g__Periplaneta;s__Periplaneta fuliginosa 1.0
MockIM42 k__Animalia;p__Arthropoda;c__Arachnida;o__Opiliones;f__Phalangiidae;g__Phalangium;s__Phalangium opilio 1.0
MockIM44 Unassigned 1.0
MockIM46 Unassigned 1.0
MockIM47 Unassigned 1.0
MockIM49 k__Animalia;p__Arthropoda;c__Insecta;o__Blattodea;f__Ectobiidae;g__Supella;s__Supella longipalpa 1.0
MockIM5 k__Animalia;p__Arthropoda;c__Insecta;o__Hemiptera;f__Aphididae;g__Aphis 1.0
MockIM52 Unassigned 1.0
MockIM53 Unassigned 1.0
MockIM7 Unassigned 1.0
(qiime2-2019.1) [devon@premise
I’ve played around with the alignment parameters for VSEARCH and BLAST, lowering the required percent identity, lowering the percent query coverage, and changing the number of max accepts, but my changes don’t seem to make a difference. The same mock taxa that are Unassigned remain Unassigned.
Compare that to a Naive Bayes output:
Feature ID Taxon Confidence
MockIM10 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Tortricidae;g__Choristoneura 0.908172187013527
MockIM15 k__Animalia;p__Arthropoda;c__Insecta;o__Diptera;f__Bombyliidae;g__Lepidophora 0.9999162473399126
MockIM16 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Crambidae;g__Elophila;s__Elophila obliteralis 0.723477092257559
MockIM20 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Geometridae;g__Haematopis;s__Haematopis grataria 0.9998725917638673
MockIM21 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis 0.9979024774608144
MockIM23 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis 0.8907844724522537
MockIM27 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Hydrophilidae;g__Cercyon;s__Cercyon praetextatus 0.999513579421321
MockIM28 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Erebidae;g__Hyphantria;s__Hyphantria cunea 0.8642753632086654
MockIM29 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Erebidae;g__Hypena 0.7329953217374743
MockIM3 k__Animalia;p__Arthropoda;c__Insecta;o__Diptera;f__Culicidae;g__Aedes;s__Aedes vexans 0.7940660229756092
MockIM32 k__Animalia;p__Arthropoda;c__Insecta;o__Ephemeroptera;f__Heptageniidae;g__Leucrocuta;s__Leucrocuta maculipennis 0.9999941323197901
MockIM33 k__Animalia;p__Arthropoda;c__Insecta;o__Neuroptera;f__Mantispidae;g__Dicromantispa;s__Dicromantispa sayi 0.9975133918473684
MockIM39 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Chrysomelidae;g__Ambiguous_taxa;s__Ambiguous_taxa 0.9087409988686354
MockIM4 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Noctuidae;g__Agrotis 0.877931324358032
MockIM40 k__Animalia;p__Arthropoda;c__Insecta;o__Blattodea;f__Blattidae;g__Periplaneta;s__Periplaneta fuliginosa 0.9999354221679166
MockIM42 k__Animalia;p__Arthropoda;c__Arachnida;o__Opiliones;f__Phalangiidae;g__Phalangium;s__Phalangium opilio 0.7515350497208974
MockIM44 k__Animalia;p__Arthropoda;c__Insecta;o__Diptera;f__Chironomidae;g__Procladius 0.9982240429814992
MockIM46 k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Scarabaeidae;g__Euphoria;s__Euphoria fulgida 0.9999004186249357
MockIM47 k__Animalia;p__Arthropoda;c__Insecta;o__Orthoptera;f__Tettigoniidae;g__Scudderia 0.9999961171698349
MockIM49 k__Animalia;p__Arthropoda;c__Insecta;o__Blattodea;f__Ectobiidae;g__Supella;s__Supella longipalpa 0.9999999925306753
MockIM5 k__Animalia;p__Arthropoda;c__Insecta;o__Hemiptera;f__Aphididae;g__Aphis 0.9686825818079887
MockIM52 k__Animalia;p__Arthropoda;c__Insecta;o__Orthoptera;f__Tettigoniidae;g__Conocephalus;s__Conocephalus strictus 0.999951202482104
MockIM53 k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Crambidae 0.8255706859147384
MockIM7 k__Animalia;p__Arthropoda;c__Insecta;o__Trichoptera;f__Leptoceridae;g__Ceraclea;s__Ceraclea maculata 0.9999935213355088
What’s strange is that I know that the database I’m classifying against contains many of these species (I built the database myself). For example, the expected taxonomy for MockIM7 is k__Animalia;p__Arthropoda;c__Insecta;o__Trichoptera;f__Leptoceridae;g__Ceraclea;s__Ceraclea maculata
and indeed if we search for that species in the reference database being used, there are multiple distinct sequences identified:
4161825;Insecta;Trichoptera;Leptoceridae;Ceraclea;Ceraclea maculata
8277685;Insecta;Trichoptera;Leptoceridae;Ceraclea;Ceraclea maculata
2549710;Insecta;Trichoptera;Leptoceridae;Ceraclea;Ceraclea maculata
2549712;Insecta;Trichoptera;Leptoceridae;Ceraclea;Ceraclea maculata
3747815;Insecta;Trichoptera;Leptoceridae;Ceraclea;Ceraclea maculata
So it should be identified by an alignment approach, but for some reason there isn’t any match being detected.
I’d appreciate any insights on to how to further troubleshoot.
Cheers,
Devon