Classify-consensus-vsearch - high number of unassigned features

Nicholas_Bokulich · December 20, 2017, 4:18pm

It seems like these unclassified seqs are probably the missing Euryarchaeota but are < 97% similar to the reference sequences so probably are not actually what they have been classified as previously. There are a couple things that I recommend, and the details are below:

Pull out some of the sequences that are not classifying, and run an NCBI blast search on them to see what these seqs could be. (many of the unclassified seqs could also be things like non-target DNA)
Fiddle with the vsearch parameters and/or try out the classify-sklearn method for comparison (that method tends to be a little more accurate than vsearch) to see if you get more sequences classifying.

With real samples you never "know" the true composition — so the differences you see here could just as likely be the "correct" one. Differences in OTU picking methods, etc, in qiime1 could also explain the differences you see here, and you should really be benchmarking against mock community samples if you want to figure out the more accurate approach... well, we have done that benchmarking (comparing taxonomy classifiers, not OTU pickers) and find that parameter optimization is key.

You are probably correct, that these unassigned sequences are the same that are otherwise being classified as Euryarchaeota. Given the parameters you are using, it also seems possible that the classifications you received previously are not correct (i.e., those sequences are < 97% similar to known reference sequences).

Yeah that shouldn't help — 0.51 is the most lenient so increasing this can only hurt (regarding getting unclassified seqs). Lowering maxaccepts probably would help, and can be set to a lower value when using high perc-identity values — try 3 or even 1.

That is a pretty high value — it is probably acceptable and good for many of your sequences (that belong to better characterized groups), but is probably the sole cause of these unclassified seqs (they don't match anything in the reference with > 97% similarity). Note that the best alignments will be chosen one way or another — so you are effectively setting this threshold to say "I want any seqs with < 97% similarity to the reference to be unclassified".

Probably not — though you could use the 99% reference sequences instead of the 97% (in general I'd recommend 99% over 97% for classification because it will be much more sensitive to fine differences though it will increase runtime). In any case, it's worth a try. You could also try SILVA instead.

If you can get your hands on some mock communities for these species of interest (or even just simulate sequencing reads though that might not be ideal since it sounds like you have species that are not in the reference database), you can compare classifier performance and tune your classifiers specifically to improve accurate taxonomic classification of these sequences.

I hope that helps! Please let us know if any of these work!