Finding incorrectly formatted needle in apparently mostly formatted properly haystack

Good morning QIIME folks,
The problem
This question relates to looking at a barplot, realizing a taxonomic identifier is being labeled incorrectly, and trying to figure out why it’s incorrectly labeled. The incorrect label is rare, however, and I can’t find the label that is assigned in that barplot in any file that I’ve used.

The process
I’ve constructed a custom COI database with records from both the Barcode of Life Database (BOLD) and select Genbank records. Following the tutorials and posts in this forum I’ve imported a pair of .qza artifacts representing the sequence and taxonomy strings needed to run the various feature-classifier options to assign taxonomic identities (and confidences) to my sequences. The importing process was as follows:

qiime tools import --type 'FeatureData[Sequence]' --input-path my.qiimeCOI.fa --output-path ref_seqs.qza

qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat --input-path my.qiimeCOI.txt --output-path ref_taxonomy.qza

In addition, I imported an OTU table (.biom format) and a dereplicated, denoised, clustered fasta file containing my representative sequences:

qiime tools import --input-path my.OTUtable.biom --type 'FeatureTable[Frequency]' --output-path my_otutable.qza

qiime tools import --input-path my.unoise.cluster.otus.fa --type 'FeatureData[Sequence]' --output-path my_seq.qza

I then assigned assigned taxonomic information with Vsearch:

qiime feature-classifier classify-consensus-vsearch --i-query my_seq.qza --i-reference-reads ref_seqs.qza --i-reference-taxonomy ref_taxonomy.qza --p-maxaccepts 500 --p-perc-identity 0.7 --p-strand both --p-threads 4 --o-classification my_tax_Vsearch.qza

Finally, I created the barplot:
qiime taxa barplot --i-table my_otutable.qza --i-taxonomy my_taxVsearch.qza --m-metadata-file my-metadata.txt --o-visualization my_bplot

When I take a look at the resulting my_barplot.qzv file in view.qiime2.org, I noticed that there were four categorical variables assigned at Level1: “k__animalia” (expected!), “Unassigned” (expected!), and two additional:

  • “k__GAATTGGGACAGCCAGGCGCTCTTTTGGGGGACGATCAGATTTATAACGTGATTGTAACTGCTCATGCGT”
  • “k__GAATTAGGCCAACCAGGGGCCCTACTCGGAGATGATCAGATTTATAATGTAATTGTCACCGCTCATGCAT”

If I zoom in all the way to Level 7, I get hundreds of potential hits (nice!), yet those two unexpected labels are maintained. This is to say, that weird label is applied twice, and only twice, across all taxonomic levels.

I’ve tried looking back into the original .txt and .fa files I used to import but can’t find those sequences anywhere. I’m wondering how else folks might suggest debugging to ascertain why those two samples are popping up?

Many thanks

1 Like

Hi @devonorourke,
Interesting problem! Thank you for detailed info on the debugging you've done so far!

This is almost certainly related to the custom reference database you've generated — those sequences must be in there somewhere.

Would you mind sharing those files? You can send to me directly in a DM if you don't want them publicly posted.

Please also share your barplot QZV... let's start there and then work backwards.

Thanks!

Hey @Nicholas_Bokulich,
Thanks for the reply.
Let me add two other observations after continuing to work on this problem:

  1. I ran the exact same dataset through the blast classifier, and the barplot doesn’t have those two weird labels - they’re gone, and no new weirdness has arisen.
  2. I ran the exact same dataset through the same Vsearch classifier but added two additional parameters (altering the percent identity and number of max-accepts), and the weirdness was gone in the barplot again too!

Given all this, could you specify which files you’d like me to share?
If you send me an email to connect with you I can add you to the GitHub repo where these data live.

Happy Thursdsay

Hi @devonorourke,

Very interesting... I did notice that you used rather unusual settings in your initial command. What settings did you use the second time?

Let's start with:

  1. the barplot
  2. the reference my.qiimeCOI.fa
  3. the reference my.qiimeCOI.txt

I will send you a DM about getting those data.

Thanks!

Hi again @Nicholas_Bokulich,
I didn't upload the .fa or .txt files to the repo because they're pretty big, but I did upload their .qza equivalents.
You should be able to see everything here - this repo isn't private at the moment:

In terms of the settings:
qiime feature-classifier classify-consensus-vsearch --i-query Mangan_seq.qza --i-reference-reads ref_seqs.qza --i-reference-taxonomy ref_taxonomy.qza --p-maxaccepts 50000 --p-perc-identity 0.9 --p-strand both --p-threads 4 --o-classification Mangan_tax_Vsearch_p90_m50000.qza

Hi @devonorourke,
The old and new vsearch barplots appear to have used different reference files — the reference artifacts in the repo are for the new barplot.

What I can tell so far, though, is that those bizarre assignments have annotations down to species level, e.g.:
k__GAATTAGGCCAACCAGGGGCCCTACTCGGAGATGATCAGATTTATAATGTAATTGTCACCGCTCATGCAT;p__;c__;o__;f__;g__;s__

That implies that this really is an entry in the reference taxonomy file that was used for that file, unless if something really bizarre is going on.

Maybe we can just move on, though, because it looks like you do not get those strange classifications when using the new taxonomy file. I have some thoughts about your parameter settings:

The higher perc-identity that you are using here makes more sense than 0.7. I don't know about COI though, so maybe that's appropriate if you expect many query sequences to poorly match the reference db (e.g., if the reference is not complete).

That maxaccepts setting is bewildering, though, and is sure to reduce the level of classification. This is what the classifier does:

  1. Each query sequence Q will align against each reference sequence R to generate N alignments
  2. Alignments are ranked by score, and only those > perc-identity are retained.
  3. the top maxaccepts alignments are retained (as you would expect — blast does not actually work this way, btw, and I mention that because perhaps you are aware of blast's behavior and are setting a very high maxaccepts to avoid blastish behavior. vsearch maxaccepts works intuitively)
  4. Taxonomy is assigned based on the consensus taxonomy across all retained alignments — this is performed at each rank starting at the basal level, and assigned to whatever the majority taxonomy is, if the majority is ≥ min-consensus (0.51 by default).

So if maxaccepts = 50000 and min-consensus = 0.51, and assuming perc-identity does not reduce the number of alignments, you will need to have a minimum of 50000 * 0.51= 25500 reference sequences that share the same taxonomy at level L to assign taxonomy at this level. That is more likely at basal ranks, but extremely unlikely at genus and species levels, which could be why you don't ever see any genus or species assignments.

So I would recommend dialing back that setting to a more reasonable setting.

This area is actually very ripe for optimization... it would be worth generating simulated reads or other appropriate test data to determine optimal classification settings. If you are interested, that would be possible using the same benchmarks we used to optimize these classifiers for 16S and ITS data. Let me know if you are interested and I can point you in the right direction to get started.

I hope that helps!

Thanks @Nicholas_Bokulich,

I'm not positive which artifact you are referring to as old and new to you - I'm assuming "old" was Mangan_bplot_vsearch.qzv, while "new" is " Mangan_bplot_vsearch_p90m50k.qzv". The file with the p90m50k.qza extension is the "new" one to me, and the one which has none of the weird labeling in the barplot. The "old" one is the one I see an issue with.

It's quite possible that I forgot to rerun my commands with the "old" file after noticing the issue in the barplot. Nevertheless, my first strategy when I noticed the odd label was to just run a grep search in the taxonomy and fasta text files used to create the reference and taxonomy .qza files. It didn't turn up anything.

...odd. and yeah, I'm moving on from this for now. I'll let you know if it jumps up again.

Regarding the parameter settings, thanks for your explanation. I'm really interested in trying to figure out why a blast result gives a distinct result from vsearch for a given OTU. Specifically, I want to know why a given OTU might be identified in one classifier as an insect, and the other classifier as a mammal. That seems too far apart to me, and I figured I was doing something wrong.
For a bit of context, I work with bat guano. The same primer set can amplify both the arthropods that the bats eat as well as the bat DNA itself. The COI database I use consists of a custom mix of BOLD sequences and Genbank records. These sequence records have been dereplicated and truncated to a length defined by the primer sequences I used in generating the amplicons. In all, it's about 250,000 unique sequences, with probably only 100 or so of those which are a bat (it's basically a bug database with a hint of bat).
What's been frustrating me is that I've found vsearch OTUs being assigned as an arthropod, but blast pointing to a mammal. Not just any mammal, but a bat. And not just any bat, but the exact bat species I know should be in our sample. The blast search is absolutely correct in this one OTU instance.

Okay, with that context, hopefully this explains my stupid parameters. I thought what I was doing with my vsearch perc-identity parameter was making it more stringent for a "hit" in the global alignment search to be retained. I think that's right.
It sounds like what QIIME is doing with maxaccepts is not what I had predicted given this post on the vsearch Github issues page. I was reading that post thinking that a value of --maxaccepts 1 would keep the first hit that the alignment deemed above my defined percent identity, then assign that taxonomic identity to the OTU/sequence in question, then move on to the next OTU. Looking specifically at one of the last comments from the developer in that thread:

You could also use --maxaccepts 0 --maxrejects 0 , which will cause vsearch to consider all database sequences. It will take much longer as all the heuristics are bypassed

That was, in part, why I'm confused a bit about the documentation in QIIME. In your current documentation for the vsearch plugin you mention that the --p-maxaccepts parameter has a range from 0 to infinity. So I initially started with that 0 value entered as my term, but it got rejected:

Error: Invalid value for "--p-maxaccepts": 0 is smaller than the minimum valid value 1.

So, I assumed that if I couldn't enter zero, I should enter a huge number like 50,000, thinking that at least then it would search through and retain half of my database before picking the best winners.

This is where I'm really confused about the math you've suggested. What I want is the most stringent terms possible, at least for one pass of the data, to determine which OTUs are absolutely bat-associated sequences.

... Does this make sense?

I think the thing I should mention in all of this is that I just want to get the OTUs as correct as possible. I didn't have any reason to suspect my earlier work with vsearch was doing anything wrong until I started comparing vsearch to Blast. I do have a very good idea as to what bat species should be in the guano, and vsearch just seems to be wrong with respect to the OTUs assigned to bats. As far as the arthropods... bats are basically flying vacuums, and have a huge diversity in diet - it seems to span spatial, temporal, and host boundaries (so far as I can tell from the thousands of pieces of crap I've sequenced).

Regarding benchmarking, I do conveniently have some mock communities that have been spiked into some of my datasets. These consist of about 20 known arthropod sequences and I could likely use them to run the benchmarking tools your team has so wonderfully created. Feel free to follow up with me if there are more things to know than what has been provided in the link you've sent.

Thanks for your reply. Oh, and if you happen to know Jeff foster at NAU, give him a high five for me, as he's my once and former advisor.

Yes, I was using those same definitions.

Thanks! Do let me know.

Is this NCBI blast or classify-consensus-blast or standalone blastn?

Yes, that will definitely make it more stringent (though percent-id is better for toggling stringency)... but since taxonomy is based on consensus here that is also causing problems for you! Here's what is happening:

250k unique seqs (let's say 1% == 2.5k of these are bat, rest are arthropod)
You grab 50k of these
2.5k will be bat (presumably should be the top hits!) and the remainder will be bug (provided they are above the similarity threshold).
your query will be classified as some type of arthropod because 47.5k/50k = 0.95 >> 0.51 (default consensus threshold)

So this is really a quirk of your database and the characteristics of the sequences and their similarities... but a really cool "edge case" for testing the limits of this classifier! Also makes a great test for optimization since you have very distinct species that are evidently somewhat similar at this locus...

We use maxrejects=0, so the whole database is searched for matches before ranking the top hits, so using maxaccepts=1 will find you the true top hit (unless if I am missing something!).

But that is a great question — and actually blast maxaccepts works in that non-intuitive way (takes first N hits that exceed percent id, then skip the rest, instead of finding the true top hits) so great sleuthing!!!

Oops, that's just a very minor bug in setting the parameter limits, and I have raised an issue to fix it here. You solved it with the correct workaround — setting a very high number for maxaccepts.

Presumably bat sequences will be very similar to the ref.... so you can take a first pass of the classifier with a low maxaccepts and high percent-id to match bat seqs. Better yet, since you probably just want to remove the bat seqs since these are host DNA, just use qiime quality-control exclude-seqs to remove host DNA before classification!

Then do the real round of classification. You should probably use parameters similar to the defaults... use a smaller maxaccepts and use percent-id to limit matches to relevant matches

percent-id will be the parameter to focus on for correctness. If you think that the samples contain many species that are not in the ref (or notice that empirically via high numbers of unassigned), then you need to dial back the percent-id and increase maxaccepts to get a broader consensus.

Awesome! That would be a great dataset to add to our mock community repository if you would like to share (probably after publishing).

This thread discusses "getting started" — tax-credit was a sprawling project and I've done my best to make it easy to navigate but there is still a lot there. Follow the instructions in the READMEs instead of trying to navigate the notebooks on your own!

Feel free to get in touch with me over the forum or directly via email if you have any questions or issues, I would be very glad to help you get this running through tax-credit, and am personally interested in the optimal settings for COI.

Good luck!

I'm starting to think this would have been better with a phone call!

Thanks @Nicholas_Bokulich for the detailed responses.


Is this NCBI blast or classify-consensus-blast or standalone blastn?

This phenomenon occurs with NCBI blast and classify-consensus blast. I haven't tested blastn from a local database.


Also makes a great test for optimization since you have very distinct species that are evidently somewhat similar at this locus

I should add that the query length is just 180bp, and the database I'm using is formatted to likewise fall within that range (given that the subject lengths in the database can vary because they are trimmed by primer site recognition, so an indel here and there can cause this to flux a bit)


actually blast maxaccepts works in that non-intuitive way

Really glad you pointed this out - never would have known. This leads me to wondering two things:

  1. Is it possible for the documentation to be more explicit about the differences between these two classifiers? It's obviously incumbent upon the user to read into as much as possible before using the tools, but given that QIIME is harmonizing the terms between vsearch and blast classifier plugins, yet these terms can mean very different things in outcome, perhaps a little **WARNING** or something would be helpful somewhere.
  2. It didn't occur to me to just search for the source code and check for myself what's going on under the hood. I found your GitHub repo right away, but couldn't find the script for this plugin specifically. It would be really helpful to have a link from the classifier documentation to that source code. Given that you're very smartly version controlled your documentation, I'm guessing linking it to the commit version on GitHub wouldn't be a huge burden (but what do I know!)

We use maxrejects=0, so the whole database is searched for matches before ranking the top hits, so using maxaccepts=1 will find you the true top hit (unless if I am missing something!).

I think we're talking about two different parameters here. Maybe it was just a typo on your end? Did you mean maxaccepts=0 instead of maxrejects=0? I thought I should be principally concerned with the maxaccepts, not maxrejects. Maybe I have it backwards. The bug I referred to (and the documentation for as it exists) for maxaccepts suggest that your default value for this is 10, not 0.

--p-maxaccepts INTEGER RANGE    Maximum number of hits to keep for each query. Must be in range [0, infinity]. [default: 10]

I don't see any --maxrejects available in the plugin documentation, but clearly is an option in Vsearch's own source code.

Thanks for clarifying once more what the source code says, both for the documented maxaccepts and the undocumented (or at least, hidden from my old eyes) maxrejects.

Assuming that I've picked up on how the vsearch plugin is working in QIIME, and assuming the default parameters set maxrejects and maxaccepts to 0, the expected workflow for me would be:

  1. Perform a search through my entire database and identify every sequence that passes above whatever percent-id boundary I set. I would imagine my first pass would be to set a very high stringency - something like 98 or 99% - and identify any bat-associated sequence variants.
  2. I'd then use qiime quality-control exclude-seqs to filter out those bat sequences, and perhaps dial back the percent-id to something like 96 or 97.

Finally ,can you please clarify what the fourth column, Confidence represents in these taxonomic assignment .qzv files (from qiime metadata tabulate)? How is it calculated exactly? Apologies if I missed the documentation referring to it.

Thanks very much

yep, we have an open issue to clarify the difference.

no, I meant maxrejects=0. That parameter is hardcoded under the hood — so if you look back at that vsearch issues quote you posted you will see what I mean about the behavior of that parameter.

no maxrejects=0 by default and maxaccepts=10, so it searches the entire database and takes the top 10 alignments.

no, do this:

  1. use exclude-seqs to remove bat seqs... 97% or 98% should be fine
  2. then use classify-consensus-vsearch

In the consensus classifiers, confidence I believe indicates the % of hits that agree on that taxonomy label — so it's not a very clear term. We should probably have an issue raised for that too :smile:

By the way, we are always open to contributions if you want to tackle fixing any of these issues and submit a PR! :wink: :pray:

12 off-topic replies have been split into a new topic: How to create a dereplicated sequence reference database for taxonomy classification: case of COI

Please keep replies on-topic in the future.

An off-topic reply has been merged into an existing topic: How to create a dereplicated sequence reference database for taxonomy classification: case of COI

Please keep replies on-topic in the future.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.