Classify-consensus-vsearch - high number of unassigned features

steff1088 · December 20, 2017, 2:50pm

Hi all,

when I had a closer look at my classify-consensus-vsearch output I noticed a couple of things that were weird and that would most likely require some tweaking with parameters.

First, the fraction of unassigned sequences is huge:

How can I have more sequences being binned into distinct species?

Second, I know from qiime 1 analyses and from others that the fraction of Euryarchaeota (methanogens) should be way higher in my sample, but it is at the lower % and instead, Bathyarchaeota are very abundant. I can’t exclude a true biological reason for this but it seems more likely that something happened (related to the high unassigned sequences) which filtered against Euryarchaeota in some way.

Here the command I used:
qiime feature-classifier classify-consensus-vsearch \ --i-query seqs_filtered.qza \ --i-reference-reads 97_otus.qza \ --i-reference-taxonomy 16Sonly_consensus_taxonomy_7_levels.qza \ --p-perc-identity 0.97 \ --p-min-consensus 0.51 \ --p-maxaccepts 10 \ --p-threads 12 \ --o-classification taxonomy-vsearch-SILVA_128.qza

I already manipulated p-min-consensus (0.75) without a change. How about the reference file that I am using, could that show these effects? Should I change to majority_taxonomy? I would like to keep the 97% identity threshold if possible.

Thank you for any helpful comment.

Cheers,
steffen

Nicholas_Bokulich · December 20, 2017, 4:18pm

Hi @steff1088,

It seems like these unclassified seqs are probably the missing Euryarchaeota but are < 97% similar to the reference sequences so probably are not actually what they have been classified as previously. There are a couple things that I recommend, and the details are below:

Pull out some of the sequences that are not classifying, and run an NCBI blast search on them to see what these seqs could be. (many of the unclassified seqs could also be things like non-target DNA)
Fiddle with the vsearch parameters and/or try out the classify-sklearn method for comparison (that method tends to be a little more accurate than vsearch) to see if you get more sequences classifying.

With real samples you never "know" the true composition — so the differences you see here could just as likely be the "correct" one. Differences in OTU picking methods, etc, in qiime1 could also explain the differences you see here, and you should really be benchmarking against mock community samples if you want to figure out the more accurate approach... well, we have done that benchmarking (comparing taxonomy classifiers, not OTU pickers) and find that parameter optimization is key.

You are probably correct, that these unassigned sequences are the same that are otherwise being classified as Euryarchaeota. Given the parameters you are using, it also seems possible that the classifications you received previously are not correct (i.e., those sequences are < 97% similar to known reference sequences).

Yeah that shouldn't help — 0.51 is the most lenient so increasing this can only hurt (regarding getting unclassified seqs). Lowering maxaccepts probably would help, and can be set to a lower value when using high perc-identity values — try 3 or even 1.

That is a pretty high value — it is probably acceptable and good for many of your sequences (that belong to better characterized groups), but is probably the sole cause of these unclassified seqs (they don't match anything in the reference with > 97% similarity). Note that the best alignments will be chosen one way or another — so you are effectively setting this threshold to say "I want any seqs with < 97% similarity to the reference to be unclassified".

Probably not — though you could use the 99% reference sequences instead of the 97% (in general I'd recommend 99% over 97% for classification because it will be much more sensitive to fine differences though it will increase runtime). In any case, it's worth a try. You could also try SILVA instead.

If you can get your hands on some mock communities for these species of interest (or even just simulate sequencing reads though that might not be ideal since it sounds like you have species that are not in the reference database), you can compare classifier performance and tune your classifiers specifically to improve accurate taxonomic classification of these sequences.

I hope that helps! Please let us know if any of these work!

steff1088 · December 20, 2017, 4:24pm

Thanks for the explanation, this is great advice. Let me work on it and come back to you in any case.

Thank you @Nicholas_Bokulich !

steff1088 · January 3, 2018, 6:19am

Hi all,

I have been working on my problem, that was, too many unassigned sequences showed up in my taxonomy (see above). @Nicholas_Bokulich I followed your advice and tried out several things. Changing single parameters within the vsearch feature-classifier did not have any major effects. I had some success, though, using a lower identity cut-off. There seem to be relations with the open-reference clustering step. For instance, classification would fail if (1) the reference sequences/reads (from SILVA) were not the same cut-off (97_otus.qza in clustering not compatible with 94_otus.qza in the classification - the same level has to be used) or (2) the reference taxonomy file was not the same cut-off as the reference reads file. I am using the consensus_taxonomy_all_levels file from the SILVA package.

Here the 2 contrasting outputs when I clustered and classified my data at 94% and 97% respectively. (Both files generated by using all taxonomy files and ID value at one consistent percentage identity cut-off in clustering and classification)

The unassigned sequences revealed some of the hidden Euryarchaeota, Bathyarchaeota are interestingly still quite abundant. The fraction of unassigned reads shrunk significantly with 94% identity cut-off. However, the assigned taxa seem to me more shallow, which makes sense. Now, this seems to be kind of a dead end. I would like the taxonomy to go more in depth (7 levels at least), but also to keep the unassigned sequences low. Is it possible to tune something in the clustering step to achieve better results? Especially, I would like these Bathyarchaeota groups to be more teased apart, hopefully indicating deeper taxonomic levels..

Any idea or advice would be greatly appreciated.

Thank you,
steffen

Nicholas_Bokulich · January 3, 2018, 4:50pm

Hi @steff1088,
Thanks for the follow-up!

This seems a bit surprising — are you using closed-reference OTU picking? I would expect that many OTUs would still be within the % cutoff, e.g., to some 94% reference OTUs, though I suppose some "fallout" makes sense. At the very least, a lower % cutoff query should work with a higher cutoff reference (e.g., 97% OTUs should be classifiable with 99% reference seqs).

That makes sense — the reference taxonomy and reads must always match! 97% and 94%, for example, will have different representative OTUs (cluster centroids) and thus different sequence IDs.

For many parameters, this makes sense — but what about changing maxaccepts? This is at the very least diagnostic. Using maxaccepts=1 should result in the top hit being chosen, without any consensus assignment. Thus, if you are getting high numbers of unassigned seqs then this is an issue with your identity threshold, not with the consensus assignments (this is most likely the case but useful to determine anyhow). More on this below...

A few thoughts:

have you tried 99% reference sequences? This would provide the most sensitivity, though it probably would not help the unclassified seq problem.
have you tried the classify-sklearn method for comparison? This might provide a better balance between sensitivity and specificity here — it's worth a shot. You could also use this classifier just to reclassify the seqs that fail to classify with vsearch (due to the high perc-identity setting).

Classifying against the 99% reference OTUs but with a lower perc-identity (e.g., 94%) should do the trick. perc-identity is really just telling the classifier to toss out any results that have lower % similarity (leading to high numbers of unclassified seqs), and does not need to be tied to the % cutoff of the reference OTUs or query OTUs. So you should set this value lower, increasing the number of hits that pass — and then adjust maxaccepts and min-consensus to deal with consensus assignments that result in lower taxonomic specificity due to the higher number of hits.

I hope that helps!

steff1088 · January 3, 2018, 7:06pm

ok @Nicholas_Bokulich I will do exactly that, classifying against the 99% OTUs at a lower perc-identity (94%). Now, 2 follow-up questions: I have a vague idea but can you explain in more detail what the maxaccepts and min-consensus parameter really do? I kind of understand based on the vsearch documentation but your explanation would probably clarify it entirely. On the same note, is it advisable to choose (when using SILVA) the majority- or consensus-taxonomy file accordingly?

Great help, thanks!!

steffen

Nicholas_Bokulich · January 3, 2018, 7:31pm

maxaccepts defines the number of top hits to be selected for assigning consensus taxonomy, and min-consensus determines how many of these must match at each taxonomic level for that taxonomy label to be accepted.

For example, let's imagine maxaccepts = 5. So we get the following top 5 hits (if maxaccepts=1, the top hit would be supplied at the taxonomic assignment):

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__fragilis
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__fragilis
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__fragilis
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__dorei
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__uniformis

If min-consensus = 0.51 (default), then any label that is observed > 50% of the time is chosen as the consensus label. So in this case the species assignment would be "fragilis" because that is the annotation of 3/5 of the seqs. If this value were increased, e.g., to 0.75, the minimum consensus cannot be achieved at species level and thus the consensus taxonomy would be:

k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides

(i.e., no species annotation).

I have not really done a direct comparison, nor am I aware of one, so cannot really comment conclusively. I prefer the majority, because this will "crowd out" some spurious and uncertain annotations (e.g., sequences with annotations like "uncultured bacterium"). But I see strengths in both approaches, and others may have very different opinions on this topic!

steff1088 · January 3, 2018, 7:49pm

Thank you @Nicholas_Bokulich that explanation clarified it once and for all!

I am making the same observations and would therefore prefer the majority as well.

Cheers,
steffen