clarification/verification of taxonomy classifiers

Rob_DNA · May 20, 2022, 7:44am

Hello,

for the first time I'm using the taxonomy classifiers in QIIME2 for the taxonomy assignment to my fungal (18S) and bacterial (16S) MiSeq sequences. For now, I'd like to use the SILVA database. I've worked with metabarcoding before, but training a classifier is new to me.

If I get it right, I have two options:

Using the pre-trained " Silva 138 99% OTUs full-length sequences" file, found at the top of: Data resources — QIIME 2 2022.2.0 documentation

This classifier is not trained on my specific primers. So to get this classifier, all reference sequences of SILVA where included to train the classifier, instead of first extracting your own reads? (in contrary to option 2 below ) This is a file that can be used for classifying all 16S/18S sequences right? Are there any downsides for using this classifier for taxonomic assignment of fungal/bacterial 18S/16S sequences? (apart from that it does not focus on your region of interest, such as option 2 below)

I've googled/searched the forum for some more information on training a classifier, but I still do not fully understand it. I know that using a trained classifier improves performance, but why is this exactly? Why not 'just' use a non trained reference database such as the silva database?

I can extract the reads from the SILVA database of with my particular primers and then train the classifier:

For that I would first download " Silva 138 SSURef NR99 full-length sequences" and Silva 138 SSURef NR99 full-length taxonomy from Data resources — QIIME 2 2022.2.0 documentation under "Marker Gene reference Databases".

Do I understand it correcly that the downloads under the "Marker Gene reference database" are "raw" databases, from which you can extract your own reads?

Then I would extract my region of interest:

qiime feature-classifier extract-reads \
 --i-sequences silva-138-99-seqs.qza \
  --p-f-primer GTGCCAGCMGCCGCGGTAA \
  --p-r-primer GGACTACHVGGGTWTCTAAT \
  --p-identity 0.8 \
  --p-min-length 175 \
  --p-max-length 500 \
  --o-reads silva-138-99_ref-seqs_extracted.qza

and then train the classifier:

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads silva-138-99_ref-seqs_extracted.qza \
  --i-reference-taxonomy silva-138-99-tax.qza \
  --o-classifier classifier.qza

Now I have a fully functional classifier trained on my specific region of interest, which I can use for taxonomic assignment of my reads.

This is the correct way right?

Thank you very much!

SoilRotifer · May 20, 2022, 1:55pm

Hi @Rob_DNA, let's see if we can get you sorted.

This depends on which classifier you are using. The pre-made SILVA classifiers were constructed from the NR99 SSU version of the database, after some additional quality control via RESCRIPt, you can check out the tutorial. You can use the RESCRIPt plugin to download the full, raw SILVA database instead of the NR99.

Correct. The main downside, is a practical one... that is, the file size and memory footprint of the full-length classifiers can be quite large. Preventing their use on machines that lack appropriate resources. The amplicon-specific versions are much smaller. Depending on your taxa of interest, you might lose a tiny bit of classification accuracy compared to the amplicon-specific versions, but this has been minimal for the data sets that I've worked with. Though your mileage may vary. You can always compare the outputs.

Great questions! I refer you to these great papers. Several of these also discuss the benefit of constructing an amplicon-specific region classifiers. But in a nutshell, there are many cases in which several different organisms have identical DNA sequence over the amplicon region. Making it hard to disambiguate between taxa. That is BLAST, for example, might return all equivalent hits, which may not be helpful. We would prefer that consensus taaxonomy, or lowest common ancestor (LCA), be returned for our query sequence.

Yes. Although we provide the raw files that were used to make the classifiers, and the classifiers themselves (for the full-length and V4 region of the SSU gene), you can use RESCRIPt (linked above) to choose among several versions of the SILVA database and curate as you'd like. Many on the forum have made their own V3V4 classifier for example.

Yep.

Finally, RESCRIPt provides some tools to help you compare the various reference databases and classifiers you generate.

-Cheers!
-Mike

Rob_DNA · May 20, 2022, 2:23pm

Hi Mike,

thank you very much for your elaborate response! It is really nice to experience that people at the QIIME2 are so helpful !

I have a follow up question, based on your response:

Yes. Although we provide the raw files that were used to make the classifiers, and the classifiers themselves (for the full-length and V4 region of the SSU gene), you can use RESCRIPt (linked above) to choose among several versions of the SILVA database and curate as you'd like. Many on the forum have made their own V3V4 classifier for example.

Alright so these files - see 2 files selected in grey in the picture below to make sure we are talking about the same files - are not actually the ''raw'' files, but have first been processed by RESCRIPt to make them more QIIME-compatible (as is stated on the page).

You mention that RESCRIPt can be used for own curation of the SILVA database etc. However, for "general usage" (how one would define this), the 2 files I use are suitable for creating a trained classifier based on extracted reads using own primers, right?

Thank you very much!

SoilRotifer · May 20, 2022, 2:53pm

Correct. We made use of the processing steps as outlined in the tutorial.

This is explained, in detail, through the RESCRIPt SILVA tutorial I linked in my initial response.

This would be the quickest way to proceed, as much of the general processing has been performed for you. In effect, you would be continuing from this step of the tutorial.

Again, if you do not agree with the processing steps outlined in the tutorial you can start from scratch and process as you'd like. Also, I think the pre-made files are using SILVA 138 and not the updated 138.1 version. This is now the default version in RESCRIPt, i.e. rescript get-silva-data ... .

Rob_DNA · May 23, 2022, 6:57am

Thank you very much Mike!

system · June 23, 2022, 12:57pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.