Sir, there's no fly in my digital soup. what gives?

devonorourke · November 30, 2018, 5:45pm

Since classifying my sequences with vsearch, I've started to wonder how the completeness of my taxonomy assignment varied as a function of the frequency of an ASV being detected, as well as the number of reads associated with that ASV.
I had no idea what to expect. On one hand, if a sequence is found in many of my samples (high frequency) and amplifies really well (high number of reads), then it's pretty likely this COI sequence has been seen by someone else before. Because I'm dealing with bugs found in New Hampshire, not some extreme environment filled with unexplored archaea, my guess is that the stuff that is most detectable is likely going to have voucher specimens that were part of the database I started with to assign taxonomy.
On the other hand, maybe not!?! Maybe the stuff that generates the most reads is some weird chimera that slipped through the DADA2 pipeline.

So I started by taking the original dataset - then filtered to retain all my ASVs which were at least assigned some level of taxonomic information. That is, any ASV that was "Unassigned" was removed. There was still over 10,000 ASVs!
I then filtered the ASVs by completeness of taxonomic information required. So, to be in the Order dataset, the ASV had to at least have taxonomy information at the Class and Order level, but could be missing data at the Family, Genus, and Species levels. I didn't use QIIMEs filtering tools, but it functions just the same. There is also a Family dataset where the ASV had to at least be complete up to the Family level (so, could be included if missing Genus or _Species), as well as a Genus dataset (can be missing only Species info).

I then made a plot with each of those 4 datasets. The x-axis represents the number of times that ASV was detected, and the y-axis represents the number of total sequence reads attributed to that ASV across all my samples. The points themselves are thus a representation of a particular ASV for the entire dataset. The point here is that if you require certain amounts of taxonomic information, you'd notice that I lose quite a few of my key data points! The colored points represent the 5 most frequent taxa I detected. Only 1 of 5 contain mostly complete taxonomic information. You can see their Order | Genus names in the legend:

I recovered each of those sequences with BLAST searches; 4 of the 5 samples were somewhere between 99-100% alignment identity and length, and nothing else was close (the one sample which was only 90% identical is highlighted with an asterisk).

So why would these super frequent ASVs be missing so much taxonomic information?

They are present in the reference database. In some cases there are multiple unique sequences for the same taxonomy.
These sequences BLAST to the same taxonomy within NCBI. It doesn't appear they are incorrectly curated.

The one that is most confusing to me is the Megaloptera | Chauliodes taxonomy. One ASV is assigned by Vsearch with the appropriate information, but then another ASV is totally missing it. The two ASVs that NCBI think are both Chauliodes are almost identical - just 1bp difference between the 181bp amplicon.

Why would it behave so differently for nearly identical sequence?

I'm still investigating the other situations, but figured I'd post now in hopes of someone passing along some insights into the error of my ways.

Thanks!

Nicholas_Bokulich · November 30, 2018, 6:24pm

Hi @devonorourke,
I am not sure that this shows that there is a relationship between frequency and taxonomic completeness. Your top 5 most abundant features are clearly outliers so that the 2-5th most abundant drop out at family level is not really an indicator of any relationship. Plot taxonomic completeness vs. frequency to assess the relationship.

There's one detail about your analysis that is confusing me: in the plot, you have the top 5 most abundant ASVs colored and labeled by order and genus affiliation. Is that label coming from re-assigning these with top-hit NCBI BLAST? I assume so, since if they all have genus labels they should all pass your filters and appear in all plots.

In general, how are you doing vsearch and blast classification? I assume blast is NCBI BLAST against the default database — but pls clarify if you are using a custom db for that. What parameters are you using for vsearch?

I ask about vsearch parameters because in a previous post you were using very high maxaccepts parameters, which will reduce the risk of overclassification but only at the cost of high underclassification (the problem you are having with these top 5).

Bottom line: use vsearch outside of QIIME 2 to reconstruct the alignment process. Use the sequences for those 5 ASVs as input. Use the exact same db as a reference fasta. Use the exact same parameters, drop them into this command:

vsearch --usearch_global seqs_fp \
    --id perc_identity \
    --strand strand \
    --maxaccepts maxaccepts \
    --maxrejects 0 \
    --output_no_hits \
    --db ref_fp \
    --threads threads \
    --blast6out output-filepath

This will let you see what your sequences are aligning against with the given params, and adjust accordingly. Chances are your maxaccepts are too high, or you have incompletely annotated sequences in your reference.

devonorourke · November 30, 2018, 6:44pm

Thanks @Nicholas_Bokulich - indeed, the earlier threads in which you helped me through were instructive on not setting such a high --p-maxaccepts value and I have made sure that instruction was followed when I generated the data discussed in the current post.

These data were generated by first identifying and removing any host-associated COI reads. That filtered feature table was the object used in my vsearch classification. The database used for taxonomic assignment is comprised of arthropod sequences only (an earlier iteration included both arthropod and chordate). The parameter for vsearch in this instance were the defaults:

qiime feature-classifier classify-consensus-vsearch \
--i-query ../reads/arthseqs.qza \
--i-reference-reads "$REF"/ref_seqs.qza \
--i-reference-taxonomy "$REF"/ref_taxonomy.qza \
--p-perc-identity 0.97 \
--p-strand both \
--p-threads 24 \
--o-classification oro.v97.qza

Nothing special there, I don't suspect.

Regarding your questions about the plot and the labeling:

The only thing I was intending with this plot was to highlight the fact that the most frequently detected ASVs are missing most of their taxonomic information, and that strikes me as odd. It's not to say the plot shows anything about a correlation - that would require the plot you suggested, and I wouldn't be just highlighting those outliers in that case.
My intention with the plot was to suggest that of my most frequent ASVs, 1 contains information through to a Genus-level classification, only 2/5 contain at least Order level, and the other 3 contain no information other than to indicate they are an insect (and weren't deemed "Unclassified"). I haven't even plotted the Unclassified ones, which I need to do now that I think of it!
The taxonomic information is generated in a generic web-based BLAST through NCBI nr database.

Thanks for the advice on running vsearch. Perhaps I should also run the same taxonomy assignment with QIIME's BLAST classifier too.

Hope that clarifies your questions. Appreciate the quick reply!

Nicholas_Bokulich · November 30, 2018, 7:06pm

Thanks for clarifying! Let me know what you find — manual VSEARCH should help tease this apart. You will probably need to either reduce maxaccepts further and/or adjust perc-identity.

devonorourke · November 30, 2018, 7:21pm

Still unclear why two nearly identical sequences would have such different classifications.
I'll be in touch!

Nicholas_Bokulich · November 30, 2018, 7:31pm

That's what standalone VSEARCH will tell us. My guess is that that 1 nt difference pushes the one seq over the 97% similarity threshold to include several other hits that may belong to a different lineage (or more likely have no/incomplete/incorrect annotation).

devonorourke · December 1, 2018, 12:17pm

Thanks again @Nicholas_Bokulich for your insights.

The thing about that 1 nt difference pushing the similarity threshold to include several other hits... it seems crazy to think that it would result in shifting the taxonomic profile of that ASV from something that includes everything up to and including species to maybe only now including a order or class level because of shared matches with other taxa. How could such distantly related things be the two top matches (and good matches at that)?

And yet, you were crazy correct!

For the ASV labeled in the figure in this post as the orange-labeled point: in QIIME, this is only marked as "Insect", but with NCBI BLAST online it clearly appears to be Megaloptera | Chauliodes - it's a near perfect match.
In the stand alone Vsearch run last night, it's quite clear why this is happening - just like you said!:

Standalone VSEARCH results for this ASV

6cb6696e89471d5c3d6fd77c2d3ce308	CNCSO100-14;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera	100.0	181	0	0	1	181	1	580	-1	0
6cb6696e89471d5c3d6fd77c2d3ce308	BBMEG015-09;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera	100.0	181	0	0	1	181	1	577	-1	0
6cb6696e89471d5c3d6fd77c2d3ce308	TTSOW130-09;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera,f:Corydalidae,g:Chauliodes,s:Chauliodes	100.0	181	0	0	1	181	1	658	-1	0
6cb6696e89471d5c3d6fd77c2d3ce308	TTSOW678-12;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera,f:Corydalidae,g:Chauliodes,s:Chauliodes	99.4	181	1	0	1	181	1	658	-1	0
6cb6696e89471d5c3d6fd77c2d3ce308	RRGCO799-15;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera,f:Corydalidae,g:Chauliodes,s:Chauliodes	99.4	180	1	0	1	181	1	576	-1	0
6cb6696e89471d5c3d6fd77c2d3ce308	MABR079-17;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Diptera,f:Psychodidae,g:Lutzomyia,s:Lutzomyia	97.7	88	2	0	1	181	1	543

There were 2 flies in my digital soup, and they were taxonomically very distinct. Because s:Lutzomyia was above a 97% threshold, and because it's in a different order from Chauliodes,, the consensus taxa step pushed the profile all the way back to a common Class-level. Yikes.
That's super disappointing when there is that big a difference in accuracy - that Chauliodes is clearly the better match - but given the paramters, the program is doing exactly what it should do.

I could up the percent identity to 98%, and in this certain circumstance, that would certainly be one way to solve the problem, but then I'm possibly tossing all those other matches where the first/best hit is <= 97.9%.

You mentioned possibly lowering the --p-max-accepts argument and in my above case I can see why that might be helpful, however, that also seems like it's dependent upon the number of common entries for the same taxonomic profile. In the example I've used here, say I set the value to --p-maxaccepts 2. This would capture the species I think it really is.
But if there was only one database entry for Chauliods and one entry for s:Lutzomyia, then I'd be back to the same problem.

I'm guessing the value of the consensus taxonomy classification approach is an acknowledgement that with short sequences we shouldn't be blindly picking the highest value because you can get nearly as high a value with an evolutionary distant taxa. If I end up using a 'most frequent taxa assignment within match parameters' approach I am equally worried that I'll get the less accurate match which is just has more representation in the database I could try to curate the database such that every taxa string had one and only one representative sequence, but then that opens up a whole set of other problems (principally, how do you choose which sequence is representative of a species?)...

I can see why all of this becomes a frustratingly long series choices - and a lot of testing. It feels like the 16S/ITS worlds have been doing this for years, and I don't know of anyone who has touched on these questions in COI (at least for arthropods). Perhaps there are simple steps to take to manage my risk other than what you've already suggested which mirror what some of the earlier 16S/ITS development followed. I'm just not sure of the history of that testing/development.

Thanks as always,

Devon

Nicholas_Bokulich · December 1, 2018, 2:21pm

The thing I like about working with the alignment+consensus classifiers is that they are fairly "transparent". I.e., that 1nt would make such a difference is surprising, but the cause is obvious. That 1nt just pushes it over the threshold, probably supporting matching to the one additional sequence needed to reduce the consensus threshold.

So you have the right idea — there are a few ways to address this (you mentioned a few, I will list + add others) but by fixing it for this specific sequence you don't know how you will impact other sequence classifications, potentially leading to poorer classifications for those.

Adjust maxaccepts. In this case even maxaccepts 5 would do the trick. But yes this would impact other classifications, potentially in bad ways.
Adjust the percent identity — but then some other sequences may become unclassified.
Database curation: you mention eliminating replicates, but as you say this will open up a big old can of worms!!! Better yet would be to add additional sequences for sparsely represented taxa (if you knew the sample size for each taxon you could set very rational maxaccepts but the uneven representation makes this tricky). But good luck with that. A third option is to remove all incompletely classified taxa. E.g., the sequence labeled only as k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera is causing problems here, and how useful a sequence like that is is questionable. I have done this for other databases with success... it may be benchmarked in tax-credit if you want to take a look.
Develop a new consensus classifier. Obviously this is not a quick fix! But classifiers that do some sort of dynamic matching, e.g., only perform consensus matching on matches that tie for percent identity. This would actually be really easy to do using the existing vsearch classifier as a template... you would just create a new consensus assignment algorithm and peg it on the back.

Indeed! Benchmark what works best for arthropod COI. We have already discussed this — I made tax-credit expressly for this purpose, so that others can re-use for this same type of benchmarking.

That's the past few years of my life .

devonorourke · December 1, 2018, 7:09pm

Thank you for all the terrific options and recommendations.

I think I'm going to have to just suck it up and benchmark all of it. I've gone this far into the weeds; I might as well go all the way through it.

One quick thing about the vsearch results though: the example I was giving in the previous post where I had two species > 97% and one of them was a better match... I'm going to highlight just three of the results:

6cb6696e89471d5c3d6fd77c2d3ce308	CNCSO100-14;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera	100.0	181	0	0	1	181	1	580	-1	0
6cb6696e89471d5c3d6fd77c2d3ce308	TTSOW130-09;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera,f:Corydalidae,g:Chauliodes,s:Chauliodes	100.0	181	0	0	1	181	1	658	-1	0
6cb6696e89471d5c3d6fd77c2d3ce308	MABR079-17;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Diptera,f:Psychodidae,g:Lutzomyia,s:Lutzomyia	97.7	88	2	0	1	181	1	543

Am I reading vsearch's manual properly and interpreting that second numerical field (the one following percent identity) to represent the alignment length? In the first two matches, with bother were aligning at 100% identity, they were matching 181 bp long (which is the exact length of my amplicon). However, in the third result, which I'm confident is the wrong assignment, the percent identity is marginally lower at 97.7% but the alignment length is much lower at just 88 bp.

This seems like the first and perhaps most sensible place to start with my filtering. It doesn't appear that there is such an option exists in vsearch - that you can specify a percent identity minimum and a minimum length of alignment overlap. I'd love to know why.

I'm guessing I can whip up a little awk command to impose such a filter on the output of the standalone vsearch to perform such a trick, but usually when I have a filtering idea that isn't already in a program, it's usually because someone else thought of it and recognized it's a bad idea. This one seems pretty straightforward though - in my circumstance where I have a 181 bp amplicon, there is no good reason to ever keep a > 97% identity match that is less than, say 100 bp, right? Or even 150 bp, or maybe even 170.

That seems like where I'd start the benchmarking. At least I can put all those mock community samples to good use finally.

Thanks again for your thorough and thoughtful replies!

Nicholas_Bokulich · December 1, 2018, 10:05pm

Good eyes. We have an old open issue to get that parameter added (the issue mentions blast, but most features we add for blast we also add for vsearch). Now that two users have pointed out issues with this, I think it may be time to add that...

query_cov sounds like what you need

This would also be a very easy fix in q2-feature-classifier — just add that parameter to this line, expose the parameter as a function argument on this line and register the parameter following the example of others here. If you want to give it a whack and submit a pull request to get it added to q2-feature-classifier you would earn my undying respect. Otherwise I can do this some time in the next week if you don't mind waiting.

I agree. In the case of amplicon classification you would probably always want very high coverage. Either way it would be useful to expose this parameter in vsearch (and blast).

yes! please do! If you are interested, you could share those mock communities in mockrobiota (we don't have any COI mock communities) and add the pre-processed data to tax-credit to make it easier to build on your benchmarks later on.

Nicholas_Bokulich · December 1, 2018, 11:14pm

@devonorourke
Never mind — I went ahead and fixed this since I realized the fix would take me less time than writing another forum post.

To install the changes, you will need to clone my branch of the q2-feature-classifier repo. (run the following in a QIIME 2 environment — you may want to conda install a new development environment or clone your existing environment just in case).

git clone https://github.com/nbokulich/q2-feature-classifier.git
cd q2-feature-classifier
git checkout issue-100
pip install -e .

You should be able to confirm that the installation worked. Running qiime feature-classifier classify-consensus-vsearch --help | grep -A 2 'query-cov' should yield the following:

  --p-query-cov FLOAT             Reject match if query alignment coverage per
                                  high-scoring pair is lower. Must be in range
                                  [0.0, 1.0].  [default: 0.8]

That's the parameter you want. Please give that a try and let me know what you find! You have a good test case to make sure this is working as intended (and if you want to do me a favor also test out classify-consensus-blast to see if it's working the same way — query_cov works a bit differently there so it might not, but it would be nice to know how it works with your test data).

devonorourke · December 2, 2018, 12:13pm

5 different runs set up this morning:

Running stand alone Vsearch again but this time incorporating the --query_cov parameter you pointed out.

2 and 3. Running QIIME Vsearch classifier with your modification to incorporate the query coverage parameter. I've set two values for testing: default at 0.8, and a higher value at 0.94.

4 and 5. Running QIIME Blast classifier. Same two values being tested for query coverage of 0.8 and 0.94.

One thing to note: there isn't any documentation about the --p-query_cov parameter in the doc page for this, but a line describing the parameter does pop up if you type the qiime feature-classifier classify-consensus-blast command in the terminal.

Hopefully I'll have something for you tomorrow. Thanks for the huge help

Nicholas_Bokulich · December 2, 2018, 2:07pm

Yep — the online docs are only updated at each release. This is not only the development (non-release) version, it is also not even merged (i.e., you are using my own personal branch of this plugin). Thanks for testing!

devonorourke · December 2, 2018, 2:28pm

I'm talking about BLAST, not Vsearch.
Did you update the query coverage option to both in that git update, or just Vsearch?
Either way, thanks!

Nicholas_Bokulich · December 2, 2018, 2:52pm

I added that parameter for both yesterday, so you should see the same difference between online docs and local help docs.

devonorourke · December 3, 2018, 2:36pm

Looks like the parameter is working as expected, and it very, very clearly matters. Here's one instance for example:

The ASV, prior to implementing the query coverage length requirement produces these top 10 matches:

|8aa059190885cb9440084a9821d998bd|SSWLB7316-13;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Diptera,f:Chloropidae,g:Diplotoxa|97.8|45|1|0|1|181|1|474|-1|0|
|---|---|---|---|---|---|---|---|---|---|---|---|
|8aa059190885cb9440084a9821d998bd|SSPAC11777-13;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Diptera,f:Stratiomyidae,g:Sargus,s:Sargus|97.8|45|1|0|1|181|1|436|-1|0|
|8aa059190885cb9440084a9821d998bd|GBMH1847-06;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Orthoptera,f:Tettigoniidae,g:Banza,s:Banza|97.6|42|1|0|1|181|1|721|-1|0|
|8aa059190885cb9440084a9821d998bd|SSKJC775-15;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Diptera,f:Empididae,g:Rhamphomyia|97.4|39|1|0|1|181|1|405|-1|0|
|8aa059190885cb9440084a9821d998bd|GBCMD4841-09;tax=k:Animalia,p:Arthropoda,c:Malacostraca,o:Decapoda,f:Ocypodidae,g:Minuca,s:Minuca|97.4|39|1|0|1|181|1|522|-1|0|
|8aa059190885cb9440084a9821d998bd|ANGEN067-15;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Odonata,f:Libellulidae,g:Pantala,s:Pantala|97.2|36|1|0|1|181|1|453|-1|0|
|8aa059190885cb9440084a9821d998bd|GBGL4445-07;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Lepidoptera,f:Geometridae,g:Eupithecia,s:Eupithecia|97.2|36|1|0|1|181|1|375|-1|0|
|8aa059190885cb9440084a9821d998bd|GBCM3389-17;tax=k:Animalia,p:Arthropoda,c:Malacostraca,o:Decapoda,f:Aeglidae,g:Aegla|97.1|34|1|0|1|181|1|642|-1|0|
|8aa059190885cb9440084a9821d998bd|SAUPA241-10;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Lepidoptera|97.1|34|1|0|1|181|1|511|-1|0|
|8aa059190885cb9440084a9821d998bd|SSWLA3921-13;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Diptera,f:Tabanidae|97.1|34|1|0|1|181|1|481|-1|0|

Yep, that's 10 unique taxa assignments, all above 97%. No wonder the consensus pushed it all the way back to just the Class level.

Once the query coverage length parameter was added, this sequence was then updated as "Undefined". The local vsearch output just shows this, indicating it didn't find any subject string to match our query at the defined percent identity (97%) and query coverage length (80% or 94% both failed):

8aa059190885cb9440084a9821d998bd	*	0.0	0	0	0	0	0	0	0	-1	0

It also fails regardless of the classifier - BLAST and Vsearch both set this ASV as "Unassigned".

I'm going to run a few other taxonomy assignment tests where I reduce the database subjects according to taxonomic completeness. Even with this query coverage parameter, it's still possible to get reduced taxonomic information assigned because of competing taxonomic strings which are differentially annotated. That is, both subject strings are equally good matches (for both alignment identity and length), but one of the two subject taxonomies are incomplete - this ends up reducing the final taxonomic assignment to the query, which is possibly just an artifact of database incompleteness.

For example, Vsearch matches these possible subjects to a query ASV using our newly modified Vsearch parameters covering 97% identity and 94% query coverage:

|6cb6696e89471d5c3d6fd77c2d3ce308|CNCSO100-14;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera|100.0|181|0|0|1|181|1|580|-1|0|
|6cb6696e89471d5c3d6fd77c2d3ce308|BBMEG015-09;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera|100.0|181|0|0|1|181|1|577|-1|0|
|6cb6696e89471d5c3d6fd77c2d3ce308|TTSOW130-09;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera,f:Corydalidae,g:Chauliodes,s:Chauliodes|100.0|181|0|0|1|181|1|658|-1|0|
|6cb6696e89471d5c3d6fd77c2d3ce308|TTSOW678-12;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera,f:Corydalidae,g:Chauliodes,s:Chauliodes|99.4|181|1|0|1|181|1|658|-1|0|
|6cb6696e89471d5c3d6fd77c2d3ce308|RRGCO799-15;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera,f:Corydalidae,g:Chauliodes,s:Chauliodes|99.4|180|1|0|1|181|1|576|-1|0|

QIIME's implementation would reduce this ASV to the Order level (o:Megaloptera) which is probably a mistake. I don't know what you'd do with bacteria, but in my case, because we're dealing with insects here, I can go to BOLDs database for this specific incompletely-classified specimen and look at the actual voucher specimen the sequence was generated from.

Here's the incompletely assigned one. Here's the fully assigned one. Maybe this is a case of someone not entering all the info in their database; maybe they didn't know what they were looking at. Either way, it's pretty clear that there is one and only one species which Vsearch has identified, but because one of the two competing subject matches isn't fully complete in it's taxonomic identity, the final product is reduce to that least incomplete record. Bummer.
I'm aware that we could use an alternative "most frequent record among matches" approach rather than the consensus approach, but that doesn't feel right and could lead to more problems.

@Nicholas_Bokulich's suggestion of just removing these cases where records are incomplete to, say, the taxonomic Order is appealing, but still feels dangerous. What if that's the only representative match? I'd be throwing out taxonomic Order information for "Unassigned"...

I wondered, following such advice, just how much would I be throwing out. So I did a little calculation for reference database I'm using which is modified from BOLD. I created this reference set by taking all of BOLD's sequences that contained taxonomic information annotated as "Arthropod", then filtered out records that weren't specific to COI, then filtered out duplicates, and some sequences that were too short. There are still a lot of sequences - 2,240,513 in total - and these reference sequences have varying degrees of taxonomic completeness. Here's a count of how many records in that database match the following terms:

`c__;o__;f__;g__;s__`   == 367      (0.016%)
`o__;f__;g__;s__`       == 1770     (0.079%)
`f__;g__;s__`           == 515695   (23.0%)
`g__;s__`               == 961606   (42.9%)
`s__`                   == 1391456  (62.1%)

So, there are a tiny fraction missing Class and Order information, but a huge number of sequences missing Family, Genus, and/or Species info. How does something like this flesh out in the microbe world and databases like SILVA or Greengenes?

I'm going to run a pair of taxonomic assignment tests today using two further filtered databases: one requiring at least Family-level info (so I'll be dropping the 23%), or one containing the Genus-level information (dropping the 42.9%). We'll see how that performs compared to the existing one - I'm sure I'll get a bunch more "Unassigned", but I'm curious about how many assigned samples have more or less taxonomic information because of the LCA algorithm that reduces an ASV's assignment like with the example I mentioned above.

Last thing - is there any internal QIIME LCA script I can use? I'd like to execute it manually to generate the final taxonomic assignment directly from the stand-alone Vsearch output (the --blast6out file). It's been super helpful seeing that file to determine why the final product I see in QIIME is what it is. There are a few LCA scripts out there, but I just want to make sure I'm using exactly what QIIME is using.

More to follow - thanks for the help!

Nicholas_Bokulich · December 3, 2018, 2:59pm

Thanks for testing and confirming!

It sounds like this new parameter is particularly important for COI — to my knowledge we do not get so many partial matches with 16S... the top hits usually wind up somewhat more tidy and concordant.

That is where the consensus threshold comes into play — QIIME 2 would only reduce to order level if you set a higher consensus threshold. At default (0.51) this would classify to k:Animalia,p:Arthropoda,c:Insecta,o:Megaloptera,f:Corydalidae,g:Chauliodes,s:Chauliodes

But of course that is just for this particular case and we could imagine other likely scenarios where the ratios are flipped.

I do not know how this database was constructed but I would still be tempted to throw out anything that does not at least have order. Why is there such a high proportion of sequences missing family and genus annotation?

The situation is similar — in greengenes, if sequences from distinct taxa are clustered into the same OTU, annotation is only given to the LCA. So in that case those sequences do seem useful to keep because they are from real sequences that just cannot be resolved from distinct taxa. In greengenes, SILVA, and others you also have variations on uncultured/unclassified/unknown/ambiguous taxa that are kept in the database but given ambiguous annotations. If these were causing me the types of problems you are having with COI, I would not hesitate to try removing those, depending on my experimental goals — a questionable and incomplete reference annotation may be worse than just having query sequences "unassigned". It is worth playing around with this, though.

Sure — you could just pull out the python code that QIIME 2 uses for LCA. It would be pretty messy, but you can modify from this function.

devonorourke · December 4, 2018, 2:28pm

Thanks @Nicholas_Bokulich for the Python script, as well as the accompanying irrational confidence boost you provide by thinking I can interpret that script and figure out how to implement it... you're too kind!

I'll see whether I can hack it soon enough with it, but in the meantime, I ran the following pair of analyses using your modified QIIME Vsearch classifier. In these two cases, I've kept the same search parameters as above: 97% identity, 94% alignment length. However, unique to these cases are the database: one has been filtered to include only those references which contain at least Family-level information, and another which contains only Genus-level information.
So, the following comparison is looking at the same input query sequences, the same percent identity and alignment length parameters. The only thing that differs is the reference database the queries are aligned. There are three we can compare:

All == all references
F-plus == references must have at least Family-level taxonomic information
- (ex. k__Animalia;p__Arthropoda;c__Insecta;o__Megaloptera;f__Corydalidae;g__;s__ would be fine, but...
  k__Animalia;p__Arthropoda;c__Insecta;o__Megaloptera;f__;g__;s__ isn't included in this database
G-plus == references that must have at least Family-level info. So...
- k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Elateridae;g__Melanotus;s__ would be fine, but ...
- k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Elateridae;g__;s__ would not be included in this dataset

Recall from an earlier post in this thread that we have lots of references that have Class and Order level info, but quite a few missing Family and Genus level info:

`c__;o__;f__;g__;s__`   == 367      (0.016%)
`o__;f__;g__;s__`       == 1770     (0.079%)
`f__;g__;s__`           == 515695   (23.0%)
`g__;s__`               == 961606   (42.9%)
`s__`                   == 1391456  (62.1%)

One way of counting up the results is to compare the taxonomic completeness of the resulting query sequences. Here's how that breaks down:

tax class    |   All    |    F-plus   |    G-plus
Class        |      0   |       0     |      0
Order        |      3   |       0     |      0
Family       |   1099   |    1092     |      0
Genus        |   2386   |    2369     |   1365
Unclassified |   8973   |    9124     |   9710
Total        |  23932   |   23932     |  23932

I think that's pretty neat - despite discarding 23% of the initial All references by removing any of those missing at least Family-level information, there is only a marginal increase in the number of Unassigned sequences. I haven't yet compared the overlap between datasets (ie. how many of these assignments are the same and how many are different, for a particular ASV), but it's encouraging to see that this particular database curation step didn't result in most of my data being unassigned.

But it sure seems to have an impact on the Genus-level database, with more than 700 ASVs being discarded relative to the full database. That's not particularly surprising on the face of it - we discard almost 43% of all sequences from the full reference database here - but it still is peculiar to me because we discarded nearly a quarter of the references with the Family-level approach, and it didn't seem to make much of a difference.

Finally, this does appear to have an impact on how the classification works for the 5 ASVs I've been discussing in the context of this post (the five colored dots at the top of this page). Taxonomic assignment changed depending on the input database used to classify:

For F-plus

ac93631979895de136137d123d1e3f6a    k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Elateridae;g__Melanotus;s__Melanotus hyslopi
8aa059190885cb9440084a9821d998bd    Unassigned
c5098366c5ba2170bb0bdbb4ebb969cf    k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__;g__;s__
6cb6696e89471d5c3d6fd77c2d3ce308    k__Animalia;p__Arthropoda;c__Insecta;o__Megaloptera;f__Corydalidae;g__Chauliodes;s__Chauliodes pectinicornis 
fceb86b3055b944f94fb703f90f5aa08    k__Animalia;p__Arthropoda;c__Insecta;o__Megaloptera;f__Corydalidae;g__Chauliodes;s__Chauliodes pectinicornis

For G-plus

8aa059190885cb9440084a9821d998bd    Unassigned
c5098366c5ba2170bb0bdbb4ebb969cf    Unassigned
6cb6696e89471d5c3d6fd77c2d3ce308    k__Animalia;p__Arthropoda;c__Insecta;o__Megaloptera;f__Corydalidae;g__Chauliodes;s__Chauliodes pectinicornis
fceb86b3055b944f94fb703f90f5aa08    k__Animalia;p__Arthropoda;c__Insecta;o__Megaloptera;f__Corydalidae;g__Chauliodes;s__Chauliodes pectinicornis
ac93631979895de136137d123d1e3f6a    k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Elateridae;g__Melanotus;s__Melanotus hyslopi

Frustratingly, that one ASV which continues to be unassigned isn't in this particular BOLD database, but is clearly present as a voucher specimen in Genbank. I wonder what, if anything, can be done to combine across multiple databases.

Is this ever done with microbiome projects? the idea of combining across multiple databases? I'd be wary to include all of Genbank, except the database I'm taking my dataset from actually mines from Genbank too, so perhaps it's worth a shot...

Nicholas_Bokulich · December 4, 2018, 3:17pm

That might be a good sign (you are cleaning out dead wood, improving classifications) but you have no way to verify if these results are correct... I recommend using your mock community to verify that database cleaning is a good thing.

I'd say the most important factor behind database cleaning is knowing why those ambiguous classifications are there. In Greengenes, as I mentioned above, something like k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Elateridae;g__;s__ would indicate that there are multiple distinct genera in family Elateridae that share that same sequence (or rather collapse into that same OTU). So if BOLD is doing the same thing I'd recommend mulling whether or not cleaning out these ambiguous classifications is a good thing — since the ambiguous taxa are actually intentional and informative in that case (they tell you that that OTU cannot be resolved beyond family level). Where database cleaning definitely makes sense is where the annotations are just uninformative and represent uncultured, unknown, clearly ambiguous groups — e.g., "unassigned" is probably equally informative to "soil fungus" or "glacial fungus" (some unfortunate annotations I've come across in the past).

Yes, this is definitely done in some fields, particularly non-16S databases, as well as to add non-target DNA to the reference (so it gets classified as, e.g., host DNA rather than "unclassified"). This can defeat the purpose — e.g., if something like BOLD is mined from Genbank but is then curated to remove garbage, you would just be adding that garbage back in plus potential replicates — but is also a good way to fill in the gaps. In general, I would be cautious about doing this with curated reference databases. Selective additions are fine, though — so I'd say add this voucher specimen and others that you deem to be well curated enough to add.

devonorourke · December 4, 2018, 3:36pm

BOLD can work like Genbanks nr/nt database, and can also work like refseq; it just depends on where you pull your data from. Where I started was from their more nr/nt-like set, such that every record is a unique submission, and which may contain redundant sequence and/or taxonomic information.

My initial filters from that database were for removing sequences with no taxonomic information, sequences that contained ambiguous characters, sequences that were assigned to another non-COI gene, and super short sequences. I then removed duplicate sequences that remained.

Given my starting database and those filtering steps, it's my interpretation that any ambiguous classification is a product of the representative sequence only - not because of shared taxonomic information. There is no clustering on the front end with my database in this particular instance.

It turns out that BOLD does cluster though in a different database - see their BINS. I avoided using their BINs because they cluster all data at the same percent identity, I've found that there are many instances with my data in which multiple species are grouped into a single BIN. In circumstances where I have a query sequence aligning to a BIN database, I usually get just one match; in the case where I use my unclustered BOLD database and get multiple matches, I can possibly have the power to tease apart those species differences - that is, I can't undo their BIN clustering identity, but at least with QIIME's Vsearch plugin, I have the power to make the decisions about how to group my classified taxa with things like percent identity etc.

Given that my database is constructed as described, would you agree that the only reason why ambiguous classifications are present is that it's a product of the per-reference taxonomy being assigned? Not a product of shared taxonomic info produced by clustering a common set of sequences?

I'll get testing my mock communities next.
Thanks