Looks like the parameter is working as expected, and it very, very clearly matters. Hereās one instance for example:
The ASV, prior to implementing the query coverage length requirement produces these top 10 matches:
|8aa059190885cb9440084a9821d998bd|SSWLB7316-13;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Diptera,f:Chloropidae,g:Diplotoxa|97.8|45|1|0|1|181|1|474|-1|0|
|---|---|---|---|---|---|---|---|---|---|---|---|
|8aa059190885cb9440084a9821d998bd|SSPAC11777-13;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Diptera,f:Stratiomyidae,g:Sargus,s:Sargus|97.8|45|1|0|1|181|1|436|-1|0|
|8aa059190885cb9440084a9821d998bd|GBMH1847-06;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Orthoptera,f:Tettigoniidae,g:Banza,s:Banza|97.6|42|1|0|1|181|1|721|-1|0|
|8aa059190885cb9440084a9821d998bd|SSKJC775-15;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Diptera,f:Empididae,g:Rhamphomyia|97.4|39|1|0|1|181|1|405|-1|0|
|8aa059190885cb9440084a9821d998bd|GBCMD4841-09;tax=k:Animalia,p:Arthropoda,c:Malacostraca,o:Decapoda,f:Ocypodidae,g:Minuca,s:Minuca|97.4|39|1|0|1|181|1|522|-1|0|
|8aa059190885cb9440084a9821d998bd|ANGEN067-15;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Odonata,f:Libellulidae,g:Pantala,s:Pantala|97.2|36|1|0|1|181|1|453|-1|0|
|8aa059190885cb9440084a9821d998bd|GBGL4445-07;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Lepidoptera,f:Geometridae,g:Eupithecia,s:Eupithecia|97.2|36|1|0|1|181|1|375|-1|0|
|8aa059190885cb9440084a9821d998bd|GBCM3389-17;tax=k:Animalia,p:Arthropoda,c:Malacostraca,o:Decapoda,f:Aeglidae,g:Aegla|97.1|34|1|0|1|181|1|642|-1|0|
|8aa059190885cb9440084a9821d998bd|SAUPA241-10;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Lepidoptera|97.1|34|1|0|1|181|1|511|-1|0|
|8aa059190885cb9440084a9821d998bd|SSWLA3921-13;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Diptera,f:Tabanidae|97.1|34|1|0|1|181|1|481|-1|0|
Yep, thatās 10 unique taxa assignments, all above 97%. No wonder the consensus pushed it all the way back to just the Class level.
Once the query coverage length parameter was added, this sequence was then updated as āUndefinedā. The local vsearch output just shows this, indicating it didnāt find any subject string to match our query at the defined percent identity (97%) and query coverage length (80% or 94% both failed):
8aa059190885cb9440084a9821d998bd * 0.0 0 0 0 0 0 0 0 -1 0
It also fails regardless of the classifier - BLAST and Vsearch both set this ASV as āUnassignedā.
Iām going to run a few other taxonomy assignment tests where I reduce the database subjects according to taxonomic completeness. Even with this query coverage parameter, itās still possible to get reduced taxonomic information assigned because of competing taxonomic strings which are differentially annotated. That is, both subject strings are equally good matches (for both alignment identity and length), but one of the two subject taxonomies are incomplete - this ends up reducing the final taxonomic assignment to the query, which is possibly just an artifact of database incompleteness.
For example, Vsearch matches these possible subjects to a query ASV using our newly modified Vsearch parameters covering 97% identity and 94% query coverage:
|6cb6696e89471d5c3d6fd77c2d3ce308|CNCSO100-14;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera|100.0|181|0|0|1|181|1|580|-1|0|
|6cb6696e89471d5c3d6fd77c2d3ce308|BBMEG015-09;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera|100.0|181|0|0|1|181|1|577|-1|0|
|6cb6696e89471d5c3d6fd77c2d3ce308|TTSOW130-09;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera,f:Corydalidae,g:Chauliodes,s:Chauliodes|100.0|181|0|0|1|181|1|658|-1|0|
|6cb6696e89471d5c3d6fd77c2d3ce308|TTSOW678-12;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera,f:Corydalidae,g:Chauliodes,s:Chauliodes|99.4|181|1|0|1|181|1|658|-1|0|
|6cb6696e89471d5c3d6fd77c2d3ce308|RRGCO799-15;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera,f:Corydalidae,g:Chauliodes,s:Chauliodes|99.4|180|1|0|1|181|1|576|-1|0|
QIIMEās implementation would reduce this ASV to the Order level (o:Megaloptera) which is probably a mistake. I donāt know what youād do with bacteria, but in my case, because weāre dealing with insects here, I can go to BOLDs database for this specific incompletely-classified specimen and look at the actual voucher specimen the sequence was generated from.
Hereās the incompletely assigned one. Hereās the fully assigned one. Maybe this is a case of someone not entering all the info in their database; maybe they didnāt know what they were looking at. Either way, itās pretty clear that there is one and only one species which Vsearch has identified, but because one of the two competing subject matches isnāt fully complete in itās taxonomic identity, the final product is reduce to that least incomplete record. Bummer.
Iām aware that we could use an alternative āmost frequent record among matchesā approach rather than the consensus approach, but that doesnāt feel right and could lead to more problems.
@Nicholas_Bokulichās suggestion of just removing these cases where records are incomplete to, say, the taxonomic Order is appealing, but still feels dangerous. What if thatās the only representative match? Iād be throwing out taxonomic Order information for āUnassignedāā¦
I wondered, following such advice, just how much would I be throwing out. So I did a little calculation for reference database Iām using which is modified from BOLD. I created this reference set by taking all of BOLDās sequences that contained taxonomic information annotated as āArthropodā, then filtered out records that werenāt specific to COI, then filtered out duplicates, and some sequences that were too short. There are still a lot of sequences - 2,240,513 in total - and these reference sequences have varying degrees of taxonomic completeness. Hereās a count of how many records in that database match the following terms:
`c__;o__;f__;g__;s__` == 367 (0.016%)
`o__;f__;g__;s__` == 1770 (0.079%)
`f__;g__;s__` == 515695 (23.0%)
`g__;s__` == 961606 (42.9%)
`s__` == 1391456 (62.1%)
So, there are a tiny fraction missing Class and Order information, but a huge number of sequences missing Family, Genus, and/or Species info. How does something like this flesh out in the microbe world and databases like SILVA or Greengenes?
Iām going to run a pair of taxonomic assignment tests today using two further filtered databases: one requiring at least Family-level info (so Iāll be dropping the 23%), or one containing the Genus-level information (dropping the 42.9%). Weāll see how that performs compared to the existing one - Iām sure Iāll get a bunch more āUnassignedā, but Iām curious about how many assigned samples have more or less taxonomic information because of the LCA algorithm that reduces an ASVās assignment like with the example I mentioned above.
Last thing - is there any internal QIIME LCA script I can use? Iād like to execute it manually to generate the final taxonomic assignment directly from the stand-alone Vsearch output (the --blast6out
file). Itās been super helpful seeing that file to determine why the final product I see in QIIME is what it is. There are a few LCA scripts out there, but I just want to make sure Iām using exactly what QIIME is using.
More to follow - thanks for the help!