Doubts about naive bayes classifications on silva 138_99 and greengenes_2022

Hey guys,

I want to understand the naive bayes classification of silva 138_99 and greengenes2 should be performed.

Given the tutorial (Training feature classifiers with q2-feature-classifier — QIIME 2 2023.2.0 documentation), the commands used should be like I putted below.

I saw in this other post (How to train the classifier for V3-V4 region with 99% identity using full length seuqnces from new relase of GreenGenes-2022?? - #7 by buzic) that for greengenes2 it should be like this:

qiime feature-classifier extract-reads
--i-sequences 2022.10.backbone.full-length.fna.qza
--p-f-primer GTGGTGGTGGTGGTGGTG
--p-r-primer GGACTGGACTGGACTGGA
--p-min-length 100
--p-max-length 600
--o-reads gg_12_10_ref_primer_region_seqs.qza

then use your newly trimmed sequence file along with the backbone taxonomy to train your classifier:

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads gg_12_10_ref_primer_region_seqs.qza
--i-reference-taxonomy 2022.10.backbone.tax.qza
--o-classifier gg_12_10_primer_region-classifier.qza

So my questions are these:

-There is no need to do these steps of importing reference datasets before extracting reads according to tutorial? Like below?

qiime tools import
--type 'FeatureData[Sequence]'
--input-path 85_otus.fasta
--output-path 85_otus.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat
--input-path 85_otu_taxonomy.txt
--output-path ref-taxonomy.qza

-How should it be for silva? Like this? Where do I get the fasta sequences for silva 138_99 to import as qza?

Import data

qiime tools import
--type 'FeatureData[Sequence]'
--input-path silva-138-99-seqs.fasta
--output-path silva-138-99-seqs.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat
--input-path silva_138_99_taxonomy.txt
--output-path ref-taxonomy.qza

Extract reads

qiime feature-classifier extract-reads
--i-sequences silva-138-99-seqs.qza
--p-f-primer GTGCCAGCMGCCGCGGTAA
--p-r-primer GGACTACHVGGGTWTCTAAT
--o-reads ref-seqs.qza

Train classifier

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref-seqs.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classifier classifier.qza

And the last ones:

-Is it really necessary to use -p-trunc-len and -pmin-lnegth and -p-max-length if I have already performed DADA2?
-What is the difference between Silva SSU rescript and trainig with naive bayes?

Thank you in advance.

2 Likes

Hi @Liviacmg :wave: :smiley:

Hopefully I can be of some help. I'll go through your questions one at a time!

The lovely people at Qiime2 have done this for you! If you go here and scroll down to the section entitled Silva (16S/18S rRNA) you can see it says " We also provide pre-formatted SILVA reference sequence and taxonomy files here ....." . Download the .qza files from here and you can extract reads continue as mentioned.

I presume here you mean for the read extraction step - you don't have to. But, if you see the blue notes in the tutorial on the extract reads section, here, it mentions " The --p-trunc-len parameter should only be used to trim reference sequences if query sequences are trimmed to this same length or shorter." It also says " the min-length and max-length parameters to exclude simulated amplicons that are far outside of the anticipated length distribution using those primers.".

That page also has lots of advice about this kind of thing and it absolutely worth a read!

RESCRIPT is a tool to build, format and manage databases. This will make a database. Databases are used to create a classifier you will use to examine your sequences. The naive bayes method is used to train the database and create a classifier.

best,

Vic

4 Likes

Hi @buzic ,

Thank you so much for the quick response!!! :slight_smile:

1 Like

Hello again @buzic ,

What about training naive-bayes for V4 regions? Should it be like i putted below? Just change --i-sequences "2022.10.backbone.full-length.fna.qza" for the --i-sequences "2022.10.backbone.v4.fna.qza" on the extracting reads part? Is the "full-length" for all regions of 16S and "v4" specific for V4?

qiime feature-classifier extract-reads
--i-sequences 2022.10.backbone.v4.fna.qza
--p-f-primer GTGGTGGTGGTGGTGGTG
--p-r-primer GGACTGGACTGGACTGGA
--p-min-length 100
--p-max-length 600
--o-reads gg_12_10_ref_primer_region_seqs.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads gg_12_10_ref_primer_region_seqs.qza
--i-reference-taxonomy 2022.10.backbone.tax.qza
--o-classifier gg_12_10_primer_region-classifier.qza

Hi @Liviacmg,

A pretrained NB model for V4 relative to Greengenes2 2022.10 is available at the resources link @buzic provided. If you wish to re-train, then it is necessary to extract from "full-length"

Best,
Daniel

2 Likes

Hi @wasade ,

Thank you sooo much!!! :slight_smile: :slight_smile:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.