Taxonomic Assignment to V4V5 16S rRNA with SSU-SILVA (version 138)

CarlaB · May 20, 2021, 3:12pm

Hey everyone!

I have been trying to find out the best way to perform the taxonomic assignment to my 16S rRNA V4V5 set of sequences. I have found 2 options so far, considering the fact that Naive Bayes classifiers trained on the region of the target sequences improves the taxonomic classification accuracy, as said here:

Use RESCRIPt, making an amplicon-region specific classifier according to my data.
Use full-length pre-formatted SILVA reference sequence and taxonomy files already processed with RESCRIPt (found in Data Resources ). Then, use qiime feature-classifier extract-reads command to constrain it to V4V5 and re-train it with qiime feature-classifier fit-classifier-naive-bayes command. At this step I have got the same warning that many other users:

packages/q2_feature_classifier/classifier.py:101: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.23.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
warnings.warn(warning, UserWarning)

So, should I worry? I don't really know the version of scikit-learn used to process the available files.

Does this second way to proceed make sense? Do I need to dereplicate the extracted region before train the classifier? And, what would be the difference between these two options?

I hope you can help me!!!

Best wishes,
Carla

SoilRotifer · May 20, 2021, 4:50pm

Hi @CarlaB, welcome to :qiime2:!

If you search the forum you'll find several threads regarding scikit-learn versions.

Essentially, you should be downloading the files from the Data Resources page, that matches the version of QIIME you are using. Note the drop-menu in the upper left of the page. If you are indeed using the correct version, then that means something altered your environment that changed the version of scikit-learn.

If you look through the first linked thread above you'll see an example of how to determine this by looking through the provenance.

Nicholas_Bokulich · May 20, 2021, 5:25pm

Hi @CarlaB ! Just answering your latter questions:

yes.

If you plan to use SILVA 138 or Greengenes 13_8, use the pre-formatted databases in the data resources page!

If you would rather use a different data source or a different version of these databases, use your option #1 (RESCRIPt)!

No you don't need to, but it will speed things up if you have many redundant sequences (i.e., because sequences that are unique 16S might have 100% alignment in 16S subdomains)

Option 1 gives you more flexibility if you want it (e.g., if you disagree with the default options used for processing these databases you can start from scratch and choose your own filtering options).

Option 2 saves you time!

Good luck!

CarlaB · May 20, 2021, 9:02pm

Hey Nicholas,

I think you answer solves all of my questions here. So thanks!!

Greetings
Carla

CarlaB · May 20, 2021, 9:02pm

Thank you for the quick and useful reply. I checked were you told me to and scikit versions match.

Have a nice day!
Carla

system · June 21, 2021, 3:02am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.