Finding incorrectly formatted needle in apparently mostly formatted properly haystack

Nicholas_Bokulich · October 19, 2018, 1:47pm

Yes, I was using those same definitions.

Thanks! Do let me know.

Is this NCBI blast or classify-consensus-blast or standalone blastn?

Yes, that will definitely make it more stringent (though percent-id is better for toggling stringency)... but since taxonomy is based on consensus here that is also causing problems for you! Here's what is happening:

250k unique seqs (let's say 1% == 2.5k of these are bat, rest are arthropod)
You grab 50k of these
2.5k will be bat (presumably should be the top hits!) and the remainder will be bug (provided they are above the similarity threshold).
your query will be classified as some type of arthropod because 47.5k/50k = 0.95 >> 0.51 (default consensus threshold)

So this is really a quirk of your database and the characteristics of the sequences and their similarities... but a really cool "edge case" for testing the limits of this classifier! Also makes a great test for optimization since you have very distinct species that are evidently somewhat similar at this locus...

We use maxrejects=0, so the whole database is searched for matches before ranking the top hits, so using maxaccepts=1 will find you the true top hit (unless if I am missing something!).

But that is a great question — and actually blast maxaccepts works in that non-intuitive way (takes first N hits that exceed percent id, then skip the rest, instead of finding the true top hits) so great sleuthing!!!

Oops, that's just a very minor bug in setting the parameter limits, and I have raised an issue to fix it here. You solved it with the correct workaround — setting a very high number for maxaccepts.

Presumably bat sequences will be very similar to the ref.... so you can take a first pass of the classifier with a low maxaccepts and high percent-id to match bat seqs. Better yet, since you probably just want to remove the bat seqs since these are host DNA, just use qiime quality-control exclude-seqs to remove host DNA before classification!

Then do the real round of classification. You should probably use parameters similar to the defaults... use a smaller maxaccepts and use percent-id to limit matches to relevant matches

percent-id will be the parameter to focus on for correctness. If you think that the samples contain many species that are not in the ref (or notice that empirically via high numbers of unassigned), then you need to dial back the percent-id and increase maxaccepts to get a broader consensus.

Awesome! That would be a great dataset to add to our mock community repository if you would like to share (probably after publishing).

This thread discusses "getting started" — tax-credit was a sprawling project and I've done my best to make it easy to navigate but there is still a lot there. Follow the instructions in the READMEs instead of trying to navigate the notebooks on your own!

Feel free to get in touch with me over the forum or directly via email if you have any questions or issues, I would be very glad to help you get this running through tax-credit, and am personally interested in the optimal settings for COI.

Good luck!