Hi all!
I am currently working with an environmentally collected dataset run with SE Illumina Miseq 515F 926R primers and am trying to create the classifiers to use but they are showing quite wildly different results and I would like some clarification on a few things.
firstly I had issues with the greengenes2 on my first time through with this dataset as my samples were kept at 250bp and gg2 doesn't have tips out that far, only around the 150bp length. So I went ahead and went through my analysis with my old classifier and did just fine.
Now I'm trying to make sure that my old classifier didn't run into any issues so I'm creating two new ones to compare my results with - BOTH at the 150bp length (since gg2 can't go out further than that). below is my process for creating the two classifiers.
Greengenes2 classifier
files:
150bp_gg_classified_taxonomy.qza (412.6 KB)
visualized_150bpgg2_taxonomy.qzv (2.3 MB)
(unfortunately the classifier itself was too large to upload here)
qiime feature-classifier extract-reads
--i-sequences 2022.10.seqs.fna.qza
--p-f-primer GTGCCAGCMGCCGCGGTAA
--p-r-primer CCGYCAATTYMTTTRAGTTT
--p-min-length 0
--p-max-length 0
--p-trunc-len 150
--p-n-jobs 8
--o-reads gg2-ref-seqs
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads gg2-ref-seqs.qza
--i-reference-taxonomy reference-taxonomy.qza
--o-classifier TBJ_150bp_gg2-classifier
qiime feature-classifier classify-sklearn
--i-classifier TBJ_150bp_gg2-classifier.qza
--i-reads gg2-ref-seqs.qza
--o-classification test_classification
qiime feature-classifier classify-sklearn
--i-classifier taxonomy/TBJ_150bp_gg2-classifier.qza
--i-reads taxonomy/rep_seqs_deblur_150nt.qza
--p-n-jobs -1
--o-classification taxonomy/150bp_gg_classified_taxonomy
qiime metadata tabulate
--m-input-file taxonomy/150bp_gg_classified_taxonomy.qza
--o-visualization visualized_150bpgg2_taxonomy
For brevity, here is what the assigned taxonomy visualizes to for gg2
Silva_138
Files:
TBJ_Silva_138_classifier.qza (124.5 KB)
visualized_silva138_actual_taxonomy.qzv (2.0 MB)
silva150bp_assigned_taxonomy.qza (360.4 KB)
qiime feature-classifier extract-reads
--i-sequences silva-138-99-seqs-515-806.qza
--p-f-primer GTGCCAGCMGCCGCGGTAA
--p-r-primer CCGYCAATTYMTTTRAGTTT
--p-min-length 50
--p-max-length 250
--p-n-jobs 10
--o-reads extracted_silva_138_reads
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads extracted_silva_138_reads.qza
--i-reference-taxonomy silva-138-99-tax-515-806.qza
--p-classify--chunk-size 30000
--o-classifier TBJ_Silva_138_classifier
qiime feature-classifier classify-sklearn
--i-classifier TBJ_Silva_138_classifier.qza
--i-reads extracted_silva_138_reads.qza
--p-n-jobs -1
--o-classification test_classification
qiime metadata tabulate
--m-input-file test_classification.qza
--o-visualization test_taxonomyy_silva138_vis
here is the visualization for the Silva_138 classifier
of most obvious note is the different parameters with the classifier creation itself, I changed them when I had such long computation time originally, but I don't think that should have that much of a difference other than possible the truncate length on gg2, correct? I can redo these to match exactly if needed, that is no issue. but I have a feeling something else isn't correct for what I've done. The only issue I can think of was my original files I took to create these classifiers was from the QIIME2 docs page using the 515F806R that were available there.
I can remove the unassigned in the Silva workflow in subsequent steps, but I am wanting to confirm there is an error here before I move on. Thank you in advance!
-UC