Testing feature classifier accuracy

lautaro.rostoll · March 17, 2021, 1:53am

Thanks a lot!
I was able to make an amplicon specific classifier for V1-V3 region with primers 27f and 515r.
I evaluated the classifier against the 85_OTU.qza provided, and got 100% classification accuracy.
Should I test it against a more complex dataset like Greengenes 97% OTUs?
I also would like to know if I can use the same ref taxonomy provided in this tutorial, or one specific for the greengenes 97.
Lastly, I am new to this world, and I don't know where to the those reference sequences from. Could you point me where to get them from?

thanks a lot to you and to this wonderful community!

Nicholas_Bokulich · March 18, 2021, 6:35am

Hi @lautaro.rostoll ,
Thanks for trying out RESCRIPt!

Short answer: yes! The tutorial explains in more detail, but 85% OTUs are very low-information and only useful for testing things out in tutorials, not for real analysis.

You need to use the 97% OTUs taxonomy, found in the same place as the greengenes sequences.

As that tutorial shows, for SILVA and NCBI you can use RESCRIPt to automatically download and format whatever you need. For Greengenes, go here:
https://docs.qiime2.org/2021.2/data-resources/

That site also has the preformatted sequences etc for SILVA, so you can make an amplicon-specific classifier from those sequences, no need to follow the full RESCRIPt tutorial to get the data.

Good luck!

lautaro.rostoll · March 20, 2021, 2:56pm

Thanks for your response!
I tried downloading the green genes 97% OTUS reference sequences from the data resources, but it the link for Green Genes reference seq is not working.
I used [Silva 138 SSURef sequences and taxonomy to evaluate my amplicon specific classifier but It is still running that command after 72 hours.
Should I wait until the run is done? Or the SILVARef is not appropiate to test the classifier?
thanks again!

Nicholas_Bokulich · March 20, 2021, 3:00pm

Strange... working for me. Could be a network connection or firewall issue on your end?

SILVA is very large, and takes substantial computational resources to use the evaluation command... you could wait, but it will probably wind up crashing if you are running this on a laptop locally.

Unless if you are running on a reasonably powerful computer, I recommend trying greengenes for this.

lautaro.rostoll · March 20, 2021, 6:21pm

I just tried downloading Green genes from internet explorer and it worked!
thanks!

lautaro.rostoll · March 21, 2021, 1:00am

Hello Nicholas,

I was able to evaluate my classifier using rescript.
Looking at the volatility control chart, I can see that the precision is 100% up to Phylum, but it drops down to 96% for family level, and 93% for genus level.
For the F-Measure on family level it goes down to 94% and at the genus level to 89%.

Would you accept this classifier and use it with your data? It was trained using primer 9F and 515R for the V1-V3 region, so I expect that it might not be able to call sequences that are from different regions which might be the case for the green gene ref database.

I am attaching the volatility charts in here.

Thanks!

Nicholas_Bokulich · March 21, 2021, 5:35am

Yes looks quite good and precision looks even higher. You can check the pre-print cited in the tutorial if you want some more ideas about interpreting these plots, and ideas about other evaluations to run.

Good luck!

lautaro.rostoll · March 21, 2021, 3:43pm

That's great to hear!
I will check the pre-pint
Thanks!