Hello,
This is my first dive into microbiome work and qiime.
Things have been going great until I got to the Training feature classifiers bit. I have tried to follow the tutorial on training, but I think I am missing something about reference sequences and the corresponding taxonomic classifications. The tutorial provides these files for training, but which data/files do I use for my own reference sequences and what would be the corresponding taxonomic classifications for my reference data? I downloaded the "silva-138-99-nb-classifier.qza" and tried to run:
Plugin error from feature-classifier:
The scikit-learn version (0.23.1) used to generate this artifact does not match the current version of
scikit-learn installed (0.24.1). Please retrain your classifier for your current deployment to prevent
data-corruption errors.
Debug info has been saved to /tmp/qiime2-q2cli-err-o7s8rvhl.log
This error led me to the training tutorial, where I am unclear on which reference data to use for my own data.
We are looking at V3-V4 of 16S for hundreds of patients from dozens of locations (hands, nose, throat, etc).
thank you in advance for any guidance.
Version of QIIME 2 - Conda native install (qiime2-2021.8) on Ubuntu 20,
Welcome to the forum! I'm re-classifying this as user support, since this isnt a technical problem with the software. It sounds like you're working on a super cool project!
You can find reference files for the Silva and Greengenes databases on our data resources page. You'll need a representative sequence file and taxonomy file. Then, you can follow the tutorial for training your own classifier (use your primers rather than the EMP 515-806).
You can also follow the RESCRIPt tutorial to download and format your own database.
Thanks @crusher083 - I think that suggestion might not work in this case, though --- q2-feature-classifier 2021.8 has a hard pin on scikit-learn 0.24.1 --- force installing an older version of scikit-learn will cause conda to uninstall q2-feature-classifier.
Thank you all! I have been away from my servers for a few day, but will be connecting tomorrow and will follow up on the suggestions above. @[jwdebelius, if I understand you correctly, I do not use my own data for training, I should use the provided reference sequences and corresponding taxonomic classifications? May I assume that since these are publicly available, the training is already done and there are up-to-date trained data? Or is there some reason the users need to do the training? Prolly dumb questions
I ran feature classifier using the downloaded "silva-138-99-nb-weighted-classifier.qza" and it worked>
I ran: #!/bin/bash qiime feature-classifier classify-sklearn \ --i-reads deblur_output/representative_sequences.qza \ --i-classifier taxa_classifiers/silva-138-99-nb-weighted-classifier.qza \ --p-n-jobs 32 \ --output-dir taxa
It completed in a reasonable amount of time. The tutorial says something about trusting the person who trained, but I am just starting so I do not really trust myself. I will also try to follow the training tutorial using my primers, but they are standard V3 and V4 primers.
Thanks again for all the amazing and quick support!