Hello all. This post is following up on an idea/suggestion forum here BLAST parser for assign_taxonomy.
I am constructing some classifiers for 18S amplicon data. I originally came here to get some support but figured things out via troubleshooting. Some of this may be useful to other folks using 18S. I still have some questions and thoughts for discussion so I figured I would post in ‘User support’. For the most part I am following https://docs.qiime2.org/2017.2/tutorials/feature-classifier/
Questions are at the bottom of my workflow if anyone is interested to discuss.
qiime tools import
qiime tools import
EXTRACT EMBP variable region.
qiime feature-classifier extract-reads
qiime feature-classifier fit-classifier-naive-bayes
What exactly is happening during the ‘slicing’ step. I chose 150 for an average read length. Does this mean I should train a new classifier for 250 bp reads? With 125 bp reads? Is this creating sequences kind of like kmers to try and capture all possible reads?
Noticing there are no degenerate nucleotide codes in this primer set, how much variability does this method allow when finding the primer sites (sorry if this is documented somewhere). I found that less than half of the sequences in the database have the exact forward primer (just a quick grep). I think these primers are in need of refurbishing considering the number sequences we have gained since 2006 (is that when these were made?).
How do the taxonomy viewers use the tabular format exactly. I know the ‘levels’ are separated by ‘;’. Do they also use the ‘D_9’ etc. or is that arbitrary and can be removed to clean up visualization. i.e. does the script solely use ‘;’ to divide the levels.
I would be happy to share these classifiers with the community when I polish them up. I have also completed some classifiers with the complete 18S sequences without primer trimming.
I have compared the results to taxonomy generated with alternative classifiers. They are different in several respects. Which is better, I do not know. I am continuing to test other classifier parameters and I have also written a taxonomy assignment with blast using custom scripts and importing into qiime2.
If anyone has any other databases let me know I would be happy to help format them and construct taxonomy.txt files and later classifiers for qiime2 import.