Best Feature-Classifier?

Kenneth · June 12, 2018, 6:30pm

I see there are three methods for the q2 feature-classifier plugin:

classify-consensus-blast
classify-consensus-vsearch
classify-sklearn

My question is which one of these methods is the best?
If there isn't a best method, what are the pros/cons of each method? When should you use each one?

My objective is to minimize the number of unclassified features in my artifacts.
I would like to start with the best classifier and if there are a lot features still unclassified I will perform a BLAST search on each one.

I tried to look on the forum before posting but couldn't find any post that really answered this question.

Thank you for your time and q2 is great.

colinbrislawn · June 12, 2018, 6:40pm

Good morning Kenneth,

There's a paper for that. Paper. Also PeerJ preprint.

That's pretty much what classify-consensus-blast does, along with infering the Lowest Common Ancestor (LCA) from the blast hits. Vsearch is like blast, but much much faster. These are both a lot like the default assign_taxonomy.py from qiime 1. (We can compare and contrast these methods if you would like.)

classify-sklearn counts then k-mers inside of a read, then uses a naive bayes classifier to give them a taxonomy. It's similar to the RDP Wang classifier.

Let me know if that helps,
Colin

Nicholas_Bokulich · June 12, 2018, 6:47pm

There is no "best" — but classify-sklearn in general performs better out of the box, and is our general recommendation for 16S and ITS sequences. All are very accurate at genus level, however, and reasonably accurate at species level (which is to say the defaults are optimized to NOT classify at species level if a confident hit cannot be determined).

See this article for a comparison of these and other classification methods, and some discussion of pros/cons and how to fiddle with the parameters.

Check out that article — look at the "high-recall" classifiers listed in Table 2. Note that minimizing unclassified and underclassified sequences with a "high-recall" classifier means that you are essentially increasing the likelihood of getting false positive hits.

If you are getting unclassified sequences with any of these methods, chances are you are either not using a good reference database, or that sequence has no hits against the current database.

Good plan. BLASTing against the NCBI database can help determine if, e.g., you have non-target hits (e.g., host DNA) present in your reads.

Thanks

I hope that helps!

Kenneth · June 26, 2018, 11:51am

@Nicholas_Bokulich @colinbrislawn,

Thanks for the reference guys! I wasn't aware it existed and it helped a lot. I ended up testing all the classifiers. BLAST and vsearch classified most of my sequences compared to classify-sklearn.

Nicholas_Bokulich · June 26, 2018, 11:57am

Could you clarify? Do you mean that those methods provided more species-level classifications than classify-sklearn?

Or do you mean that classify-sklearn resulted in more unclassified sequences?

I suspect you are describing the first. This is not to say that those species-level classifications are wrong, but you just don't know that they are right (unless if you are testing on a mock community). With short sequence reads it can be quite difficult to obtain reliable species-level classifications; classify-sklearn is designed to handle these cautiously and only give a genus-level classification if it cannot confidently assign a species.

at the end of the day, though — stick with whatever method you like the most. I'm not arguing for one over the other, just highlighting a philosophical point.