I’ve been researching the subject of assigning taxonomy -in terms of choice of database and method of classification- on Q2’s forums and according to papers and official websites cited within. I am still not convinced to use sklearn, as I don’t find it straightforward. Furthermore, compared to BLAST+, I find it more challenging to decide on specificities for sklearn - primarily due to inexperience using it. This is in spite of the good explanations of the confidence parameter and how to alter k-mer specifications for sklearn on the forums.
My question is, are there any resources available for SILVA or greengenes to use with BLAST consensus classifications on Q2 (i.e. separated reference sequences and associated taxonomy files)? I know there are pre-trained classifiers available for sklearn, but with such large databases, it is difficult to prepare reference database inputs for BLAST.
I hear your concern. Sometimes complexity is good, though. We designed and optimized the classify-sklearn method to work "out of the box" so that you can use the default settings and get reasonable results without the need for a user to adjust the settings or intimately understand how it works.
Still, I get it if you'd rather use something you have better control over. In that case I recommend using the classify-consensus-vsearch method, which is just much better than classify-consensus-blast, and has more features to give you more control.
See here for more description of all of these methods, and benchmarking results:
Yes, on the same data resources page where the pretrained classifiers are found.
But I recommend waiting until next week when the next release of QIIME 2 comes out, and then take a look at the 2020.6 release data resources. The SILVA database files were built in a better way, using RESCRIPt
Thanks for sharing that article and the resources, @Nicholas_Bokulich! I completed a comparison of my samples using vsearch and my own trained classifier, and didn’t observe any major differences in community composition. In training a classifier, I extracted reference reads with the same parameters I used in DADA2. In my comparison, I kept the parameters for the extracted reference reads the same in terms of --p-trunc-len and --p-trunc-left. Is that OK? Does changing these parameters have any significant impact on the trained classifier?
It could... if the extracted read are shorter than the query, it's bad news (could lead to unclassified or misclassified reads). If longer, it doesn't make a big difference (e.g., if you extract to the primer regions but don't trim/truncate more)