Issues with choosing classifiers on feature-classifier classify-sklearn

UnevenCuttlefish · January 18, 2024, 5:20pm

Hi all!
I am currently working with an environmentally collected dataset run with SE Illumina Miseq 515F 926R primers and am trying to create the classifiers to use but they are showing quite wildly different results and I would like some clarification on a few things.

firstly I had issues with the greengenes2 on my first time through with this dataset as my samples were kept at 250bp and gg2 doesn't have tips out that far, only around the 150bp length. So I went ahead and went through my analysis with my old classifier and did just fine.

Now I'm trying to make sure that my old classifier didn't run into any issues so I'm creating two new ones to compare my results with - BOTH at the 150bp length (since gg2 can't go out further than that). below is my process for creating the two classifiers.

Greengenes2 classifier
files:
150bp_gg_classified_taxonomy.qza (412.6 KB)
visualized_150bpgg2_taxonomy.qzv (2.3 MB)
(unfortunately the classifier itself was too large to upload here)

qiime feature-classifier extract-reads
--i-sequences 2022.10.seqs.fna.qza
--p-f-primer GTGCCAGCMGCCGCGGTAA
--p-r-primer CCGYCAATTYMTTTRAGTTT
--p-min-length 0
--p-max-length 0
--p-trunc-len 150
--p-n-jobs 8
--o-reads gg2-ref-seqs

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads gg2-ref-seqs.qza
--i-reference-taxonomy reference-taxonomy.qza
--o-classifier TBJ_150bp_gg2-classifier

qiime feature-classifier classify-sklearn
--i-classifier TBJ_150bp_gg2-classifier.qza
--i-reads gg2-ref-seqs.qza
--o-classification test_classification

qiime feature-classifier classify-sklearn
--i-classifier taxonomy/TBJ_150bp_gg2-classifier.qza
--i-reads taxonomy/rep_seqs_deblur_150nt.qza
--p-n-jobs -1
--o-classification taxonomy/150bp_gg_classified_taxonomy

qiime metadata tabulate
--m-input-file taxonomy/150bp_gg_classified_taxonomy.qza
--o-visualization visualized_150bpgg2_taxonomy

For brevity, here is what the assigned taxonomy visualizes to for gg2

Silva_138
Files:
TBJ_Silva_138_classifier.qza (124.5 KB)
visualized_silva138_actual_taxonomy.qzv (2.0 MB)
silva150bp_assigned_taxonomy.qza (360.4 KB)

qiime feature-classifier extract-reads
--i-sequences silva-138-99-seqs-515-806.qza
--p-f-primer GTGCCAGCMGCCGCGGTAA
--p-r-primer CCGYCAATTYMTTTRAGTTT
--p-min-length 50
--p-max-length 250
--p-n-jobs 10
--o-reads extracted_silva_138_reads

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads extracted_silva_138_reads.qza
--i-reference-taxonomy silva-138-99-tax-515-806.qza
--p-classify--chunk-size 30000
--o-classifier TBJ_Silva_138_classifier

qiime feature-classifier classify-sklearn
--i-classifier TBJ_Silva_138_classifier.qza
--i-reads extracted_silva_138_reads.qza
--p-n-jobs -1
--o-classification test_classification

qiime metadata tabulate
--m-input-file test_classification.qza
--o-visualization test_taxonomyy_silva138_vis

here is the visualization for the Silva_138 classifier

of most obvious note is the different parameters with the classifier creation itself, I changed them when I had such long computation time originally, but I don't think that should have that much of a difference other than possible the truncate length on gg2, correct? I can redo these to match exactly if needed, that is no issue. but I have a feeling something else isn't correct for what I've done. The only issue I can think of was my original files I took to create these classifiers was from the QIIME2 docs page using the 515F806R that were available there.

I can remove the unassigned in the Silva workflow in subsequent steps, but I am wanting to confirm there is an error here before I move on. Thank you in advance!

-UC

Nicholas_Bokulich · January 18, 2024, 6:47pm

Hi @UnevenCuttlefish ,

The issue is coming from this command:

Why are you setting max-length to 250? This is shorter than the V4 amplicon that is amplified by the 515-806 primers is in most species. Anything longer than that will be dropped — i.e., most sequences.

So you are creating a tiny classifier with only a few sequences in it, hence why you are only hitting Streptococcus or Unassigned.

You did this with the SILVA classifier but not with the GG classifier, so you are sort of comparing s and or maybe more like s and s.

Anyway, increasing the max-length to an appropriate value in that action (like > 300 nt) should do the trick.

Good luck!

UnevenCuttlefish · January 18, 2024, 7:10pm

Thank you! I figured that's where the error was coming from. I had misread the max-length parameter when I had originally done it. I wanted to make sure that was the error before I did it over again.

wasade · January 18, 2024, 8:08pm

Hi @UnevenCuttlefish,

Using all ~20 million sequences in Greengenes2 2022.10, where most are fragments, for training the classifier could have unexpected effects. Why not use either the existing full length model, or train on just the full length backbone sequences?

While it's true we primarily placed 90, 100 and 150bp sequences in Greengenes2 2022.10, the full length classifier isn't constrained to those lengths. The full length classifier can be obtained here. It's plausible the V4 classifier would just work too though the rev primer is a little different.

All the best,
Daniel

system · February 19, 2024, 2:09am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.