why could a classifier fail to classify an expected taxon?

I have a mock community consisting of 24 representative sequences. I’ve classified these using the Naive Bayes classifier as well as VSEARCH and BLAST classifiers in QIIME 2. I also have classified the same set using a fourth classifier (the VSEARCH-written version of SINTAX here). The same database served as input for all four classifiers in principle; in practice, the original set of reference sequences used for VSEARCH/BLAST alignment was first trained within QIIME 2’s fit-classifier-naive-bayes and then that was used for classification, while for SINTAX I used the original fasta file that was imported as a .qza object for VSEARCH/BLAST.

I’m concerned and confused why the two alignment-based approaches are generating a lot of Unassigned taxa, whereas the Naive Bayes classifier and the SINTAX classifier are assigning taxonomic information to every representative sequence, often to the Genus or Species rank.

For example, the VSEARCH command:

qiime feature-classifier classify-consensus-vsearch \
  --i-query path/to/mockIM4_seqs.qza \
  --i-reference-reads path/to/boldCOI.derep.seqs.qza \
  --i-reference-taxonomy path/to/boldCOI.derep.tax.qza \
  --p-maxaccepts 1000 \
  --p-perc-identity 0.97 \
  --p-query-cov 0.94 \
  --p-strand both \
  --p-threads 12 \
  --o-classification mockIM4.vsearch_out.qza

… generated this output:

Feature ID	Taxon	Confidence
MockIM10	k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Tortricidae;g__Choristoneura	1.0
MockIM15	Unassigned	1.0
MockIM16	Unassigned	1.0
MockIM20	k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Geometridae;g__Haematopis;s__Haematopis grataria	1.0
MockIM21	k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis	1.0
MockIM23	k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis	1.0
MockIM27	Unassigned	1.0
MockIM28	k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Erebidae;g__Hyphantria;s__Hyphantria cunea	1.0
MockIM29	Unassigned	1.0
MockIM3	k__Animalia;p__Arthropoda;c__Insecta;o__Diptera;f__Culicidae;g__Aedes;s__Aedes vexans	1.0
MockIM32	Unassigned	1.0
MockIM33	Unassigned	1.0
MockIM39	Unassigned	1.0
MockIM4	Unassigned	1.0
MockIM40	k__Animalia;p__Arthropoda;c__Insecta;o__Blattodea;f__Blattidae;g__Periplaneta;s__Periplaneta fuliginosa	1.0
MockIM42	k__Animalia;p__Arthropoda;c__Arachnida;o__Opiliones;f__Phalangiidae;g__Phalangium;s__Phalangium opilio	1.0
MockIM44	Unassigned	1.0
MockIM46	Unassigned	1.0
MockIM47	Unassigned	1.0
MockIM49	k__Animalia;p__Arthropoda;c__Insecta;o__Blattodea;f__Ectobiidae;g__Supella;s__Supella longipalpa	1.0
MockIM5	k__Animalia;p__Arthropoda;c__Insecta;o__Hemiptera;f__Aphididae;g__Aphis	1.0
MockIM52	Unassigned	1.0
MockIM53	Unassigned	1.0
MockIM7	Unassigned	1.0
(qiime2-2019.1) [devon@premise

I’ve played around with the alignment parameters for VSEARCH and BLAST, lowering the required percent identity, lowering the percent query coverage, and changing the number of max accepts, but my changes don’t seem to make a difference. The same mock taxa that are Unassigned remain Unassigned.

Compare that to a Naive Bayes output:

Feature ID	Taxon	Confidence
MockIM10	k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Tortricidae;g__Choristoneura	0.908172187013527
MockIM15	k__Animalia;p__Arthropoda;c__Insecta;o__Diptera;f__Bombyliidae;g__Lepidophora	0.9999162473399126
MockIM16	k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Crambidae;g__Elophila;s__Elophila obliteralis	0.723477092257559
MockIM20	k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Geometridae;g__Haematopis;s__Haematopis grataria	0.9998725917638673
MockIM21	k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis	0.9979024774608144
MockIM23	k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Coccinellidae;g__Harmonia;s__Harmonia axyridis	0.8907844724522537
MockIM27	k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Hydrophilidae;g__Cercyon;s__Cercyon praetextatus	0.999513579421321
MockIM28	k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Erebidae;g__Hyphantria;s__Hyphantria cunea	0.8642753632086654
MockIM29	k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Erebidae;g__Hypena	0.7329953217374743
MockIM3	k__Animalia;p__Arthropoda;c__Insecta;o__Diptera;f__Culicidae;g__Aedes;s__Aedes vexans	0.7940660229756092
MockIM32	k__Animalia;p__Arthropoda;c__Insecta;o__Ephemeroptera;f__Heptageniidae;g__Leucrocuta;s__Leucrocuta maculipennis	0.9999941323197901
MockIM33	k__Animalia;p__Arthropoda;c__Insecta;o__Neuroptera;f__Mantispidae;g__Dicromantispa;s__Dicromantispa sayi	0.9975133918473684
MockIM39	k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Chrysomelidae;g__Ambiguous_taxa;s__Ambiguous_taxa	0.9087409988686354
MockIM4	k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Noctuidae;g__Agrotis	0.877931324358032
MockIM40	k__Animalia;p__Arthropoda;c__Insecta;o__Blattodea;f__Blattidae;g__Periplaneta;s__Periplaneta fuliginosa	0.9999354221679166
MockIM42	k__Animalia;p__Arthropoda;c__Arachnida;o__Opiliones;f__Phalangiidae;g__Phalangium;s__Phalangium opilio	0.7515350497208974
MockIM44	k__Animalia;p__Arthropoda;c__Insecta;o__Diptera;f__Chironomidae;g__Procladius	0.9982240429814992
MockIM46	k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Scarabaeidae;g__Euphoria;s__Euphoria fulgida	0.9999004186249357
MockIM47	k__Animalia;p__Arthropoda;c__Insecta;o__Orthoptera;f__Tettigoniidae;g__Scudderia	0.9999961171698349
MockIM49	k__Animalia;p__Arthropoda;c__Insecta;o__Blattodea;f__Ectobiidae;g__Supella;s__Supella longipalpa	0.9999999925306753
MockIM5	k__Animalia;p__Arthropoda;c__Insecta;o__Hemiptera;f__Aphididae;g__Aphis	0.9686825818079887
MockIM52	k__Animalia;p__Arthropoda;c__Insecta;o__Orthoptera;f__Tettigoniidae;g__Conocephalus;s__Conocephalus strictus	0.999951202482104
MockIM53	k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Crambidae	0.8255706859147384
MockIM7	k__Animalia;p__Arthropoda;c__Insecta;o__Trichoptera;f__Leptoceridae;g__Ceraclea;s__Ceraclea maculata	0.9999935213355088

What’s strange is that I know that the database I’m classifying against contains many of these species (I built the database myself). For example, the expected taxonomy for MockIM7 is k__Animalia;p__Arthropoda;c__Insecta;o__Trichoptera;f__Leptoceridae;g__Ceraclea;s__Ceraclea maculata and indeed if we search for that species in the reference database being used, there are multiple distinct sequences identified:

4161825;Insecta;Trichoptera;Leptoceridae;Ceraclea;Ceraclea maculata
8277685;Insecta;Trichoptera;Leptoceridae;Ceraclea;Ceraclea maculata
2549710;Insecta;Trichoptera;Leptoceridae;Ceraclea;Ceraclea maculata
2549712;Insecta;Trichoptera;Leptoceridae;Ceraclea;Ceraclea maculata
3747815;Insecta;Trichoptera;Leptoceridae;Ceraclea;Ceraclea maculata

So it should be identified by an alignment approach, but for some reason there isn’t any match being detected.

I’d appreciate any insights on to how to further troubleshoot.

Cheers,
Devon

:+1: that is where I would start too. Surprising that there is no impact. Note that Unassigned is indicating that there is no hit at all, so maxaccepts does not matter — this does not even make it past vsearch/blast.

Are you sure the sequences in your mock community resemble the reference sequences? You may want to check those reference sequences to make sure that they are not too short; if the reference is shorter than the query that likely spells trouble for the alignment-based classifiers but not kmer-based.

Important question: are those classifications species that you expect?

Since you are getting the same problem with the sintax classifier, it seems pretty clear that this is probably some discordance between your mock community and the reference sequences.

Please start there and let us know what you find, maybe share some pairwise alignment visualizations!

Thanks @Nicholas_Bokulich,
I’m wondering if this is something to do with the QIIME2 implementation of VSEARCH… I’m finding a discordance between standalone VSEARCH and QIIME 2’s version. Here’s what I did for just a single mock sequence:

I pulled out all the sequences in the reference database that matched the species name expected for mock community member IM7 (saved as $DB below). I then pulled out the mock sequence for IM7 (saved as $QRY below). I then ran stand alone VSEARCH with this command:

vsearch \
--usearch_global $QRY \
--db $DB \
--id 0.97 \
--blast6out cmac.vsearch.out.txt

… and I get the following result:

Query   target  pid    qlen  mmatch    gaps   qstart   qend  sstart   send
MockIM7 2549701 100.0   204     0       0       1       228     1       658     -1      0

This makes me think that there is a good alignment here… am I misreading the output?

I then took the same $QRY sequence and the reference fasta/taxonomy information and imported it into QIIME 2, then ran this code:

qiime feature-classifier classify-consensus-vsearch \
  --i-query mockIM7_seqs.qza \
  --i-reference-reads cmacRefs_seqs.qza \
  --i-reference-taxonomy cmacRefs_taxa.qza \
  --p-maxaccepts 1000 \
  --p-perc-identity 0.97 \
  --p-query-cov 0.94 \
  --p-strand both \
  --p-threads 24 \
  --o-classification mockIM7.vsearch_out.qza  

And sure enough, I get an Unassigned value…

Feature ID      Taxon   Confidence
MockIM7 Unassigned      1.0

I think you were suggesting that there can be an issue in an alignment-based approach when the Subject reference is much shorter than the Query being aligned/classified. In this instance, my Subject is about 650 bp long and the Query is just over 200… I don’t think that’s the situation you were describing, right?


The kmer-based classifiers with SINTAX and Naive Bayes are spot on. They aren’t making any odd classifications - each of the 24 expected taxonomies are similar to what these classifiers are inferring. The distinctions are a matter of resolution: some times an expected sequence is only classified to an Order or Family level, while the classifiers assign something to Genus or Species (but they match at the Order/Family levels). In other cases its the reverse where an expected sequence has a known Genus or Species name, but the classifiers stop at Order or Family. In any event, at common levels where information exists among Observed (via classifier) and Expected, the information is a match.


I’m going to next expand my search to all those sequences that are unassigned by VSEARCH/BLAST where I have at least Genus-level taxonomic information in the mock sample. I’ll then rerun VSEARCH and see if I get more of a similar output. And then also run QIIME 2’s VSEARCH to see if there are disagreements again…

indeed, perfect match. This is very strange, since again Unassigned indicates that vsearch does not have any hits, not that QIIME 2 is mangling things somehow during consensus assignment.

When you run classify-consensus-vsearch --verbose a vsearch command should be printed to stdout... could you please run that command (with the $QRY and $DB, not the temp files listed in that command) and send me the results? (just to make sure the same exact parameters are being used)

:+1: let me know what you find!

I think I'm on to something, and I think it has to do with the query coverage parameter. Perhaps I wasn't being extreme enough in my settings...

I ran the entire 24 mock community fasta through stand alone VSEARCH. In my haze of confusion today, I neglected to incorporate any specify query coverage parameter with this code:

DB=path_to/referenceDB.fasta
QRY=path_to/mock_seqs.fasta
VSEARCH=path_to/vsearch-2.13.4/bin/vsearch

$VSEARCH \
--usearch_global $QRY \
--db $DB \
--id 0.97 \
--blast6out allMock.vsearch.out.txt

What's clear is that every one of my mock samples has a strong match to at least one reference sample when there is no query coverage parameter included. I've added a *** to the second field if a mock sample was Unassigned by QIIME 2 VSEARCH

MockAlias UnAsn pid	   qlen	mmatch	gaps	qstart	qend	sstart	send
MockIM10		99.6	228	1	0	1	228	1	1534
MockIM15	***	99  	205	1	1	1	229	1	655
MockIM16	***	99.5	203	1	0	1	228	1	582
MockIM20		99.6	228	1	0	1	228	1	969
MockIM23		98.1	215	4	0	1	228	1	683
MockIM27	***	99   	204	2	0	1	228	1	658
MockIM28		99.1	228	1	1	1	227	1	1531
MockIM29	***	100	    204	0	0	1	228	1	572
MockIM3		    100	    209	0	0	1	228	1	584
MockIM32	***	99.5	204	1	0	1	228	1	658
MockIM33	***	100	    203	0	0	1	228	1	561
MockIM39	***	100	    206	0	0	1	228	1	660
MockIM4	    ***	100	    204	0	0	1	228	1	658
MockIM40		98.7	228	3	0	1	228	1	1536
MockIM42		99.5	204	1	0	1	228	1	658
MockIM44	***	100	    204	0	0	1	228	1	543
MockIM46	***	100	    204	0	0	1	228	1	658
MockIM47	***	99	    205	1	1	1	227	1	584
MockIM49		99.6	223	1	0	1	228	1	638
MockIM5		    100	    205	0	0	1	228	1	563
MockIM52	***	99.5	204	1	0	1	228	1	621
MockIM53	***	100	    204	0	0	1	228	1	571
MockIM7	    ***	100	    204	0	0	1	228	1	658

I then realized I was missing that parameter and included it to see if there would be any difference, and yes, there was...

Side note: @colinbrislawn or any other VSEARCH pros - I can't tell from the VSEARCH manual if the query coverage parameter is invoked with some default or if it's off completely by default - any ideas?)

In my original QIIME 2 BLAST/VSEARCH runs, there were 9 of 24 sequences being assigned. In this VSEARCH run where I specified the query coverage, 8 of those 9 which are assigned are now identified by VSEARCH:

MockAlias   pid	   qlen	mmatch	gaps	qstart	qend	sstart	send
MockIM3	    99.5	215	1	0	1	228	1	681
MockIM21	99.5	215	1	0	1	228	1	683
MockIM23	98.1	215	4	0	1	228	1	683
MockIM20	99.6	228	1	0	1	228	1	969
MockIM49	99.6	223	1	0	1	228	1	638
MockIM28	99.1	228	1	1	1	227	1	1531
MockIM40	98.7	228	3	0	1	228	1	1536
MockIM10	99.6	228	1	0	1	228	1	1534
MockIM42	97.7	218	5	0	1	228	1	609

This seems to suggest that there is something different about the mock sequences that are classified relative to those that are not, and it must have something to do with how query coverage is estimated. VSEARCH's manual suggests that coverage is just a matter of ((matches + mismatches) / query sequence length), but there doesn't seem to be anything particularly different among those getting the Unassigned designation versus those that get something classified.

I went back and ran the QIIME 2 version of VSEARCH once more, and used a query coverage of just --p-query-cov 0.5, and sure enough every one of the mock sequences is assigned some sort of taxonomy, but this clearly comes with the potential issue of having multiple hits with just half of their sequence aligning.

I then went a bit further and tried to figure out what the maximum query coverage parameter value would be that would classify all 24 mock sequences, and somewhere between 0.85 and 0.9 is the answer.

To bring this back to @Nicholas_Bokulich's original comment, I don't think the reason I'm getting these Unassigned has to do with the reference sequences being very different from the mock queries I'm aligning, but there must be something going on with respect to the length of these alignments influencing whether or not they pass the VSEARCH test. Query coverage is an essential parameter in my mind to avoid the situation in which you classify partial fragments rather than the majority of the sequence, but it's not clear to me how to determine what this value should be. Even with a mock community where I expected complete overlap (and my interpretation of the overlap of the mock sequences seems to suggest that these representative mock sequences are overlapping almost completely), some appear to pass the query coverage parameter while others don't...

I wonder how to best address the coverage parameter here. All mock sequences are about the same length, all seem to have very strong percent identities with just a handful (or zero) mismatches/gaps... yet half of these sequences aren't aligning likely because of the query coverage parameter being too high. I would have thought those that were failing to align well would have many more mismatches/gaps, but I don't think that's supported in the data.

Thanks for any future pointers

Indeed. Notably, all mock seqs that are unclassified have a qend that is > qlen (even qlen + mmatch; though some of the successfully classified seqs also have lower qlens, most importantly MockIM5, which appears to be unclassified once you apply query_cov). How is it possible to have qend > qlen? Perhaps I misunderstand what this output really indicates:

qhi: last nucleotide of the query aligned with the target. Always equal to the length of the pairwise alignment, 0 otherwise (see qihi to ignore terminal gaps).

But I suspect you may have just mislabeled your columns (looks like you labeled with BLAST-style labels, not vsearch column labels) and this is really alnlen:

alnlen: length of the query-target alignment (number of columns). The field is set to 0 if there is no alignment.

In which case this all makes sense: those sequences are in fact failing to classify because qcov is < query_cov.

Aha, but it is: there is a small (but significant!) portion of these sequences that do not align.

that is what is occurring

I think a "reasonable" setting for this parameter is really context dependent. The fact that you have mock sequences that are failing to classify suggests either:

  1. COI has some sort of hypervariability that necessitates dropping coverage a bit to accommodate
  2. some of these sequences are actually chimera? or other sequence/prep artifact? (this assumes that these are actual amplicon metabarcoding reads and not the known sequences of the species you tossed in your mock community)

I'd cautiously assume #1 for now... but first maybe check out the portion of the sequences that does not align to rule out #2.

Got it - thanks for clarifying all of the above points!

I suspect what might be going on is that the primer used to generate the majority of the references is slightly upstream of the primer used to generate these mock sequences.
Imagine the following scenario:

#-----------------------------------full COI fragment --------------------------------------------------#
#----^mock_seqs^--------------------------------------------------------------------------------------#
#-----------^reference_seqs____^------------------------------------------------------------------#

In this case, the mock primer ampifies an ~200 bp product, but it starts earlier than what was used for most/all of the reference sequences in the database. Because of this, the actual overlap of the mock samples are always a bit shorter than their full length.

The author who developed the primers and mock communities wrote a map for where a bunch of these COI primers tend to fall, and this might be the case - see here


The reference sequences certainly could be chimeras, but the query sequences I'm using here are amplicons that were Sanger sequenced from the individual plasmids vectors they were constructed in. The group started with a voucher arthropod, extracted DNA from that single individual, PCR amplified the COI fragment, cloned it into a vector, PCR amplified that vector (in triplicate), Sanger sequenced those vectors, and then confirmed whether or not the sequence was identical across replicates (they weren't always!).

Nevertheless, I'm just a guy who got an email with a fasta file, and I have to take their word for it that was exactly what happened. My thinking is that these 24 representative sequences are indeed as close as you can get to a known commodity.

I never considered analyzing just how variable this section of the COI region is, and ... well it's pretty freaking divergent. A pairwise matrix of global alignment percent identities of the 24 mock samples (representing 10 unique arthropod Orders) suggests that around 75-85% identity (see here), but these samples don't really have much in the way of gaps (there's just 1 in 228 bases from what I can count).

I guess this also means that a mock community is particularly valuable in informing these alignment query coverage parameters. I'm not sure what other pieces of evidence one would consider when evaluating an optimal query coverage value - it seems like a tradeoff between sensitivity and accuracy, and one that a kmer-based alignment doesn't really have to deal with.

Well you've solved the problem for today @Nicholas_Bokulich - so many thanks for you help!

1 Like

I agree! Sounds like the ref seqs could be at fault (unless if the amplicons also contain some piece of the plasmid vector???)

how much variation is there within a species?

:heart: :heart: :heart:

totally. Inter-species divergence would help decide.

or one that it sweeps under the rug, for better or for worse?

Glad I accomplished something today :wink:!

Hey guys. This is quite the thread. I just wanted to jump in and comment on vsearch defaults.

While I can’t find documentation of the defaults either, I did find this info about query coverage:

opt_query_cov = 0.0;

Looks like there is no lower limit about how much of the query has to align to the reference. I feel in practice that very low coverage would not be detected by the k-mer prefilter, but no default is enforced here.

And this matches with what you have observed in this thread.

Colin

2 Likes

While this could be possible, I mapped the forward and reverse primers I used to amplify these mock plasmids and the 3' end is a match for the reverse complement of the reverse primer. Because I was using this same primer to generate amplicons from the guano directly too, I don't believe any portions of these proposed mock sequences contain plasmid tails.

Within a single species of a given arthropod? That sounds like a fun task to investigate, but I would imagine it varies by the type of bug. This must also be true for 16S right? And must also be true within the particular V3/4/etc. regions, right?
I wonder how to programmatically attack that question... maybe:

  1. pull out all database records with Species-level information,
  2. retain only records with at least 2 (or more?) records for the same Species,
  3. perform pairwise global alignment, then ...
  4. take the median? mean? of the percent identities? ... not sure how to approach quantifying here
1 Like

Yes

definitely

definitely

Definitely!

sounds like a plan. Quantifying is the tough part... I'd say calculate each % identity and then calculate mean/median/stdev for different taxonomic groups (e.g., at phylum, class, and order level).

This has got to be a paper, but I haven't been able to find one!

I found these paper, but these are about variation on the Illumina platform, not biological variation within known species.

Plos1 2017
Microbiome Quality Control, Nature

Can one of you point me to a paper that does this? Is the 'The All-Species Living Tree' Project could be a good resource to help test this?

If this paper has not been written, I would be interested in contributing to it. I guess I would start by dividing up silva by species, then report the mean pairwise identity within each species (full length and v4). Just like you said!

Sure... but the important thing is to statistically compare within species variation with mean nearest taxon index. If these numbers are not significantly different, that would demonstrate that amplicons don't represent the species concept. :scream_cat:

Paper time? Can I be on board for this? I'm interested.

Colin

1 Like

This is the time where I need to remind @colinbrislawn I’m trying to write papers about bat crap, and he keeps coming up with cool new paper ideas about microbes.

Stop it! :slight_smile:

Or maybe send a direct message offline? It sounds like something that sound be addressed for lots of marker genes…

The more I think about this, the harder I think it would be to do…

The input data set should probably be full length, raw, newest silva. But this has variable taxonomic levels. So simply choosing what constitutes species / subspecies / strain is hard. And most reads don’t make it all the way down to the lowest taxonomy levels.

Maybe building a taxonomic benchmark is flawed because there isn’t a consensus on taxonomy.

We could pretty easily do this for those reads that do have full taxonomy, but then our results would be biased towards will characterised microbes (likely model organisms and human pathogens).

This is harder than I thought.

Colin

Then start with animal COI maybe?
You should be able to filter records in BOLD by a few metrics that can indicate something that resembles vouchered status (photos, author records, etc.). You could look at only the full length COI records at ~650 bp (or nearly full at 500 bp).

Taxonomies are less complicated for macrofauna, right?

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.