feature-classifier extract-reads v3 v4

Hello :grinning:

I'm trying to train NB classifier with Greengenes2 in the V3 V4 region.

What p min length and p max length do you recommend inserting?
Are these parameters mandatory?

Thank you in advance

Hi @Linda_Abenaim :slightly_smiling_face:

There is a lot of advice on the forum about training databases (see here and here). The Qiime2 tutorial documents also have some useful information in the notes sections to get you thinking about your own sequences and how to make an informed decision.

Regarding what inputs are mandatory or not, you can look at the documentation and it will tell you which parameters are required or not. For example the feature-classifier extract-reads document describes the inputs (I've only pasted the top part here):

  --i-sequences ARTIFACT FeatureData[Sequence]   [required]
  --p-f-primer TEXT   forward primer sequence (5' -> 3').       [required]
  --p-r-primer TEXT   reverse primer sequence (5' -> 3'). 
                      Do not use reverse-complemented primer sequence.  [required]

These inputs are required, meaning they are mandatory and if you scroll further down on that document page, you'll see the output is also required as well (makes sense, you need to output the data :laughing:) .

So have a deeper look about and if you are still stuck after looking through these, post again and someone can help :+1:




Just wanted to mention that Greengenes 2 is a bit different than the other reference database with regards to how it is constructed and so, in case you haven't already seen this, I would recommend starting here with the developer's recommendations for non-V4 data.


Thank you so much, I will try!
But then can I always classifier with NB?

Another question I don’t understand the differences between Silva and Greengenes2.
What do you suggest for my region v3-v4?

Hello Linda,

Looks like we may be continuing the discussion started here.

Mehrbod and I both suggested the GreenGenes 2 database and qiime greengenes2 non-v4-16s, which does not require an 'extract-reads' step.

What's best for your data depends on what's in your data.

(If you included positive controls on your run with known composition, these can help you evaluate what methods work well.)

Try these out, see what works, and report back! :microscope: :telescope:

1 Like

thank you so much for your suggestion, Colin! So if I use Greengenes2 non-v4- 16 s it is not necessary the command extract reads and can I pass to the qiime feature classifier fit classifier naive bayes? Right?

Another question, what is the difference between greengenes fna.qza and nb.qza? If I use naive bayes do you suggest to download nb. qza? And what taxonomy between these I have to download?
Thank you in advance for your help :pray:


.fna are the fasta sequences
.nb is the trained Naive Bayes classifier
.tax is the taxonomy file for this backbone

(Qiime2 has some safeguards built in and will alert you to some common errors. Try it and see!)

(And you can inspect these files for yourself with https://view.qiime2.org )

1 Like

thank you so much for your suggestion!
I'm so sorry.. I'm trying to use greengenes2 non V4-16s, i don't understand if instead of icu.biom.qza
and icu.fna.qza \ in the command qiime greengenes2 non-v4-16s i have to put my table and rep-seqs or not. With greengenes2 non V4-16s i can jump to "feature-classifier classify-sklearn" step and as classifier put 2022.10.backbone.full-length.nb.qza directly, right?
Did I well understand?


1 Like

Hi @Linda_Abenaim,

If you look at the example from the tutorial I linked above, it shows that the ICU table and representative sequence file were the actual user created feature-table and rep-seqs. So for example if you ran DADA2, then you would have both these files generated at the end of that.

With non-V4 data, based on the example in the tutorial:

$ qiime greengenes2 non-v4-16s \
    --i-table icu.biom.qza \ #your own biom table
    --i-sequences icu.fna.qza \ #your own rep-seqs file
    --i-backbone 2022.10.backbone.full-length.fna.qza \ #you download this
    --o-mapped-table icu.gg2.biom.qza \ #this is new gg2-filtered feature-table you use downstream
    --o-representatives icu.gg2.fna.qza #this is new gg2-fitlered rep-seqs file used downstream.

In the background, this command is using OTU clustering on your V3-V4 reads and then it inserts those into a background tree.

Once you have these files, you can simply get your taxonomy file

$ qiime greengenes2 taxonomy-from-table \
--i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza \ #download from gg2
--i-table icu.gg2.biom.qza \ #table from last step
--o-classification icu.gg2.taxonomy.qza #new taxonomy file to use downstream

So, technically you don't need to do any of the Naive Bayes classification steps, and in fact if you look at the bottom of the tutorial in this section it recommends against using it. But, if you did want to use it anyways for some reason, you can download the full-length pre-trained classifier they provide in that page and follow the usual steps you would for your NB workflow.

Hope this helps!


I don't understand why I have this error:

Usage: qiime greengenes2 taxonomy-from-table [OPTIONS]

Pull lineage information for each feature off thereference phylogey

--i-reference-taxonomy ARTIFACT
Phylogeny[Rooted] The reference taxonomy to derive from. Note that this
input corresponds to the .nwk reference artifact
--i-table ARTIFACT FeatureTable[Frequency]
The feature table to classify [required]
--o-classification ARTIFACT FeatureData[Taxonomy]
The resulting classifications [required]
--output-dir PATH Output unspecified results to a directory
--verbose / --quiet Display verbose output to stdout and/or stderr during
execution of this action. Or silence output if
execution is successful (silence is golden).
--example-data PATH Write example data and exit.
--citations Show citations and exit.
--help Show this message and exit.

              There were some problems with the command:                  

(1/2) Missing option '--o-classification'. ("--output-dir" may also be used)
(2/2) Got unexpected extra argument ( )
/var/spool/slurmd/job6216383/slurm_script: line 22: --output-dir: command not found

this is my command:
qiime greengenes2 taxonomy-from-table
--i-reference-taxonomy taxonomygg22.2/2022.10.taxonomy.asv.nwk.qza
--i-table taxonomygg22.2/gg2filtered-table.qza \
--o-classification taxonomygg22.2/gg2-taxonomy.qza

1 Like

Hi @Linda_Abenaim,

I suspect the comments (e.g., what follows after #) are not working correct with the line breaks (\). Can you remove the comments, and any whitespace which follows the slash, and rerun?

All the best,

Thanks @wasade Now it works.
I followed the command that @Mehrbod_Estaki suggested but i noted that in the taxonomy.qzv in feature ID I don't have the feature that I found in my feature table.

What is the problem?
gg2-taxonomy.qzv (1.2 MB)

Also the taxa bar plot of gg2 taxonomy is very different from gg13_8.
taxa-bar-plots-gg2.qzv (357.9 KB)
taxa-bar-plots-bsf.qzv (377.9 KB)

Hi @Linda_Abenaim,

Sorry for the delay, I inadvertently missed the reply here. Do you have an example of a feature ID in your resulting taxonomy output, which was not in your input?

There are significant differences in the taxonomy between 2022.10 and 13_8, in part as the nomenclature itself has changed quite a bit. I would anticipate though that unweighted and weighted UniFrac, using SEPP insertion via q2-fragment-placement for 13_8, would exhibit a high Mantel correlation with 2022.10


1 Like