Taxonomy Classification Best Practices?

ashamarie · September 16, 2020, 3:53pm

Good morning! We have been training our feature classifier on our own data as recommended in the qiime2 docs ("taxonomic classification accuracy of 16S rRNA gene sequences improves when a Naive Bayes classifier is trained on only the region of the target sequences that was sequenced (Werner et al., 2012)"). However, for the sake of comparison we also performed taxonomic analysis on pre-trained classifiers as shown below. You can see the classification resolution varied dramatically, and we are unsure of whether using a self-trained classifier is actually the most appropriate method.

We appreciate any guidance the qiime2 community can give us! Thanks.

SoilRotifer · September 16, 2020, 5:08pm

Hi @ashamarie,

I might need some additional details to help... Note, most are just sanity-check questions.

What amplicon region are you actually sequencing and trying to classify?
- Are you actually sequencing with the V4 region with the 515F 806R primers?
What is "self-trained"?
- That is, what database was the reference sequence and taxonomy compiled from?
  - SILVA
  - GreenGenes
  - GenBank
  - GTDB
  - ... ?

In general, you'll likely see more diverse "hits" with SILVA compared to GreenGenes, as it is a much larger reference database.

We have just the tool for you, to help answer such questions. Check out RESCRIPt! Here is the current tl;dr install.

-Thanks!
-Mike

ashamarie · September 17, 2020, 2:08pm

Thanks, yes those are important questions!

Yes we actually sequenced the V4 region using the 525F and 806R primers (the newer versions, Apprill and Parada)
To self train we used SILVA 138, I've pasted relevant code below. We truncated reference sequences to 300bp to only train on the region of the target sequences that was sequenced in our samples

I am using SILVA 138 pre-formatted by the qiime team, downloaded from Data resources — QIIME 2 2020.6.0 documentation

qiime feature-classifier extract-reads
--i-sequences silva-138-99-seqs-515-806.qza
--p-f-primer GTGYCAGCMGCCGCGGTAA
--p-r-primer GGACTACNVGGGTWTCTAAT
--p-trunc-len 300
--p-min-length 100
--p-max-length 400
--p-n-jobs 6
--o-reads ref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref-seqs.qza
--i-reference-taxonomy silva-138-99-tax-515-806.qza
--o-classifier silva-138-99-tax-515-806_classifier.qza

qiime feature-classifier classify-sklearn
--i-classifier silva-138-99-tax-515-806_classifier.qza
--i-reads rep-seqs.qza
--p-n-jobs -2
--o-classification taxonomy.qza

qiime metadata tabulate
--m-input-file taxonomy.qza
--o-visualization taxonomy.qzv

qiime taxa barplot
--i-table sample-table.qza
--i-taxonomy taxonomy.qza
--m-metadata-file Metadata.txt
--o-visualization taxa-bar-plots.qzv

very small number of taxa were identified with this approach. Something is off.

SoilRotifer · September 18, 2020, 5:52pm

Hi @ashamarie

I am not sure why you are running this command:

The silva-138-99-seqs-515-806.qza as provided on the Data Resources page, already has these primer sequence removed. Hence the file name. That is this file is provided for the very reason you mentioned:

It is a very popular primer set. Re-running this primer extraction step on the reference data a second time may result in altering the sequences in a detrimental way, and reduce your ability to classify your ESVs, which seems to be the case given your plots.

Remember, you can investigate the provenance information provided within that file to see how that file was constructed. You can do this by running this file through QIIME 2 View, and click on the "Provenance" tab in the upper right.

-Hope this helps.