Hello all. This post is following up on an idea/suggestion forum here BLAST parser for assign_taxonomy.
I am constructing some classifiers for 18S amplicon data. I originally came here to get some support but figured things out via troubleshooting. Some of this may be useful to other folks using 18S. I still have some questions and thoughts for discussion so I figured I would post in ‘User support’. For the most part I am following https://docs.qiime2.org/2017.2/tutorials/feature-classifier/
Questions are at the bottom of my workflow if anyone is interested to discuss.
MY WORKFLOW:
DATABASE
https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_128_release.tgz
###OTUS
SILVA_128_QIIME_release/rep_set/rep_set_18S_only/99/99_otus_18S.fasta
###TAXONOMY
SILVA_128_QIIME_release/taxonomy/18S_only/99/majority_taxonomy_all_levels.txt
IMPORT OTUS
qiime tools import
–type FeatureData[Sequence]
–input-path SILVA_128_QIIME_release/rep_set/rep_set_18S_only/99/99_otus_18S.fasta
–output-path 18S_SILVA128_99_otus.qza
IMPORT TAXONOMY
qiime tools import
–type FeatureData[Taxonomy]
–input-path SILVA_128_QIIME_release/taxonomy/18S_only/99/majority_taxonomy_all_levels.txt
–output-path 18S_SILVA128_99_taxonomy.qza
EXTRACT EMBP variable region.
###http://www.earthmicrobiome.org/protocols-and-standards/18s/
qiime feature-classifier extract-reads
–i-sequences 18S_SILVA128_99_otus.qza
–p-f-primer GTACACACCGCCCGTC
–p-r-primer TGATCCTTCTGCAGGTTCACCTAC
–p-length 150
–o-reads ref-seqs_18S_99_SILVA128_RL150.qza
TRAIN CLASSIFIER
qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads ref-seqs_18S_99_SILVA128_RL150.qza
–i-reference-taxonomy 18S_SILVA128_99_taxonomy.qza
–o-classifier 18S_EMB_SILVA128_classifier.qza
#QUESTIONS/DISCUSSION
-
What exactly is happening during the ‘slicing’ step. I chose 150 for an average read length. Does this mean I should train a new classifier for 250 bp reads? With 125 bp reads? Is this creating sequences kind of like kmers to try and capture all possible reads?
-
Noticing there are no degenerate nucleotide codes in this primer set, how much variability does this method allow when finding the primer sites (sorry if this is documented somewhere). I found that less than half of the sequences in the database have the exact forward primer (just a quick grep). I think these primers are in need of refurbishing considering the number sequences we have gained since 2006 (is that when these were made?).
-
How do the taxonomy viewers use the tabular format exactly. I know the ‘levels’ are separated by ‘;’. Do they also use the ‘D_9’ etc. or is that arbitrary and can be removed to clean up visualization. i.e. does the script solely use ‘;’ to divide the levels.
-
I would be happy to share these classifiers with the community when I polish them up. I have also completed some classifiers with the complete 18S sequences without primer trimming.
-
I have compared the results to taxonomy generated with alternative classifiers. They are different in several respects. Which is better, I do not know. I am continuing to test other classifier parameters and I have also written a taxonomy assignment with blast using custom scripts and importing into qiime2.
-
If anyone has any other databases let me know I would be happy to help format them and construct taxonomy.txt files and later classifiers for qiime2 import.
Cheers,
Joseph Sevigny