BLAST, VSEARCH or NB Classifier for 18S data with PR2 database?
Hi I am new to the forum and bioinformatics. I am currently working on sediment eDNA metabarcoding for 18S data. I used the primer set Euk-1391f / EukBr and PR2 as the reference database. The goal is to characterise the benthic eukaryotic communities. The issue is that I am not sure which taxonomic assignment method to use. It seems that NB classifier is ideal for 16S but not 18S, while BLAST in QIIME has the issue that the top N hits are not the globally best-ranked hits. So…..is VSEARCH in QIIME with the global alignment approach the best option? Any comments?
How so? "ideal" is a strong word, and quite distinct from "optimized for" when we are considering bioinformatics methods. The default parameter settings for the classify-sklearn action were optimized for 16S and ITS amplicons. But the NB classifier is widely used for 18S data as well — parameters could be re-optimized for 18S but the current defaults probably work reasonably well for 18S as well. This does not mean that the classify-sklearn method is not suitable for 18S... it's just that we have not performed a rigorous benchmark, as we have with 16S (see below), but I would not expect such different performance for 18S amplicons (especially as the same default parameters also work well for totally distinct targets like 16S and ITS, the performance probably does not depend so much on the target as it does on the parameters).
This is a feature of blastn, the search stops after N hits are found that exceed the parameter settings. You can also adjust these parameters to extend the search to optimize. The consensus-based classifier is also a fairly good approach, though it requires careful adjustment.
Not necessarily. VSEARCH is also a very good choice, but as with the other two methods, may take some parameter adjustment if you want to optimize for 18S. T
he default parameters for all of these are probably "good enough" as they have been widely used for different markers (including 18S) with these defaults. If you really want to determine what is "best" for your target, you should test with an appropriate ground-truth dataset, e.g., using mock communities and/or simulated data, as we have shown, e.g., in these papers:
By the way, it looks like there are some pre-trained QIIME 2 classifiers for your target region available on Zenodo:
And if you want to build your own PR2 database, this (and some other relevant databases for 18S) are available for download via the RESCRIPt plugin.