Looks like the parameter is working as expected, and it very, very clearly matters. Here’s one instance for example:
The ASV, prior to implementing the query coverage length requirement produces these top 10 matches:
Yep, that’s 10 unique taxa assignments, all above 97%. No wonder the consensus pushed it all the way back to just the Class level.
Once the query coverage length parameter was added, this sequence was then updated as “Undefined”. The local vsearch output just shows this, indicating it didn’t find any subject string to match our query at the defined percent identity (97%) and query coverage length (80% or 94% both failed):
8aa059190885cb9440084a9821d998bd * 0.0 0 0 0 0 0 0 0 -1 0
It also fails regardless of the classifier - BLAST and Vsearch both set this ASV as “Unassigned”.
I’m going to run a few other taxonomy assignment tests where I reduce the database subjects according to taxonomic completeness. Even with this query coverage parameter, it’s still possible to get reduced taxonomic information assigned because of competing taxonomic strings which are differentially annotated. That is, both subject strings are equally good matches (for both alignment identity and length), but one of the two subject taxonomies are incomplete - this ends up reducing the final taxonomic assignment to the query, which is possibly just an artifact of database incompleteness.
For example, Vsearch matches these possible subjects to a query ASV using our newly modified Vsearch parameters covering 97% identity and 94% query coverage:
QIIME’s implementation would reduce this ASV to the Order level (o:Megaloptera) which is probably a mistake. I don’t know what you’d do with bacteria, but in my case, because we’re dealing with insects here, I can go to BOLDs database for this specific incompletely-classified specimen and look at the actual voucher specimen the sequence was generated from.
Here’s the incompletely assigned one. Here’s the fully assigned one. Maybe this is a case of someone not entering all the info in their database; maybe they didn’t know what they were looking at. Either way, it’s pretty clear that there is one and only one species which Vsearch has identified, but because one of the two competing subject matches isn’t fully complete in it’s taxonomic identity, the final product is reduce to that least incomplete record. Bummer.
I’m aware that we could use an alternative “most frequent record among matches” approach rather than the consensus approach, but that doesn’t feel right and could lead to more problems.
@Nicholas_Bokulich’s suggestion of just removing these cases where records are incomplete to, say, the taxonomic Order is appealing, but still feels dangerous. What if that’s the only representative match? I’d be throwing out taxonomic Order information for “Unassigned”…
I wondered, following such advice, just how much would I be throwing out. So I did a little calculation for reference database I’m using which is modified from BOLD. I created this reference set by taking all of BOLD’s sequences that contained taxonomic information annotated as “Arthropod”, then filtered out records that weren’t specific to COI, then filtered out duplicates, and some sequences that were too short. There are still a lot of sequences - 2,240,513 in total - and these reference sequences have varying degrees of taxonomic completeness. Here’s a count of how many records in that database match the following terms:
`c__;o__;f__;g__;s__` == 367 (0.016%)
`o__;f__;g__;s__` == 1770 (0.079%)
`f__;g__;s__` == 515695 (23.0%)
`g__;s__` == 961606 (42.9%)
`s__` == 1391456 (62.1%)
So, there are a tiny fraction missing Class and Order information, but a huge number of sequences missing Family, Genus, and/or Species info. How does something like this flesh out in the microbe world and databases like SILVA or Greengenes?
I’m going to run a pair of taxonomic assignment tests today using two further filtered databases: one requiring at least Family-level info (so I’ll be dropping the 23%), or one containing the Genus-level information (dropping the 42.9%). We’ll see how that performs compared to the existing one - I’m sure I’ll get a bunch more “Unassigned”, but I’m curious about how many assigned samples have more or less taxonomic information because of the LCA algorithm that reduces an ASV’s assignment like with the example I mentioned above.
Last thing - is there any internal QIIME LCA script I can use? I’d like to execute it manually to generate the final taxonomic assignment directly from the stand-alone Vsearch output (the
--blast6out file). It’s been super helpful seeing that file to determine why the final product I see in QIIME is what it is. There are a few LCA scripts out there, but I just want to make sure I’m using exactly what QIIME is using.
More to follow - thanks for the help!