Introducing Greengenes2 2022.10

Hi @iptz1,

Thank you for the kind words!

The precise commands we use right now to fit the V4 classifier can be found here.

I think the missing piece for V3-V4 is what --i-sequences to specify with extract-reads, and that would be the backbone full-length sequences.

Best,
Daniel

2 Likes

hello @wasade !

Thank you for posting this, it is very helpful!
I'm new learner of qiime and I have a very basic question:
I am trying to train my own classifier using greengenes, however, I'm having a hard time finding the files I need for this step (I believe one sequence file and one taxonomy file) from the wensite: Index of /greengenes_release/2022.10.
Is it: 2022.10.backbone.full-length.fna.qza and 2022.10.backbone.tax.qza

Thank you for your help!

1 Like

Hi @Xiaojing_Liu,

Yes, I believe those are the files that you need

Best,
Daniel

1 Like

Hi!
I also have problems with training the classifier using the Greengenes2. I downloaded the same files than @Xiaojing_Liu but I am completely lost on how to start with these new files. I am trying to train a classifier for my paired-end data. I have a (V3-V4) region (341F:805R). If you could guide me a little I would be very grateful!

1 Like

Hi @Melisa_Olivelli,

To train a region specific classifier, it is necessary to use extract-reads from q2-feature-classifier, followed by fit-classifier-naive-bayes. The exact commands used for the V4 classifier with Greengenes2 can be found here.

As an alternative, you could use the classifier trained on the full length records, and should work "out of the box".

All the best,
Daniel

2 Likes

An off-topic reply has been split into a new topic: Feature classifers in python

Please keep replies on-topic in the future.

Hi @wasade

This is excellent! Thank you kindly for your efforts!

I was just wondering if the command "greengenes2 non-v4-16s" is a separate package I need to download. As I receive the error "QIIME 2 has no plugin/command named 'greengenes2'". Any feedback would be greatly appreciated!

Kind regards,

Johann

Thanks @Johanndb!

To run qiime greengenes2 non-v4-16s, it is necessary to install the plugin. That can be done with pip install q2-greengenes2.

All the best,
Daniel

Hello @wasade ,
thank you for your kind advice! Can I kindly ask you one more question? I now obtai several taxa names including numbers or capital letters, such as "g__Blautia_A_141781;s__Blautia_A_141781 faecis" or "p__Firmicutes_A;c__Clostridia_258483;o__Lachnospirales;f__Lachnospiraceae;g__Mediterraneibacter_A_155507;s__Mediterraneibacter_A_155507 faecis". So the questions are:

  • what do the numbers (141781, 258483,155507,..) mean?
  • what do the letters (Firmicutes _A, Mediterraneibacter _A,...) mean?

Thank you again!
Ilaria

Hi @iptz1,

Good questions! The _A labels are directly from GTDB (see here for why). The _<number> is used to represent a distinct node in the phylogeny. In this case, "g__Blautia_A" is supported by more than one node, so we have to differentiate them to ensure the taxonomy label is unique. You can find the three Blautia clades in the Greengenes2 website if you'd like to explore the taxonomy directly.

All the best,
Daniel

1 Like

Hello, I am also using qiime2 to analyze microbiome data for the first time, I am using V5-V7 region training classifier, I want to ask you if this problem is solved? Can you share your process at this step, thank you very much!

Hi @XIAOXI,

For V5-V7 data, I recommend using the non-v4-16s action which will perform closed reference OTU picking against the backbone. Or, you could perform naive Bayes classification using the full length model.

All the best,
Daniel

8 off-topic replies have been split into a new topic: Command not found with redbiom

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: Greengenes Taxonomic Naming Schema: Letters and Numbers

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: installing Greengenes2 in a minimal environment

Please keep replies on-topic in the future.

Hello @wasade - Very helpful information! I have a couple of questions. I have paired-end human stool data (processed with dada2) that is good quality through 250 not.
Do you know if there is any advantage to using the single-end versus paired-end data (i.e. the filter-features versus non-v4-16s approach)? And you mention trimming to 150nt (in the Deblur/Dada2 section) - is that a recommendation for filter-features and/or non-v4-16s? Thanks!

2 Likes

Hi @m_s,

If the sequences were generated using 515F-806R EMP primers, then you could trim them to 150nt and filter-features. If you'd prefer to keep the full length, then you'd need to use non-v4-16s.

I'm unaware of literature that has independently benchmarked the various read stitching strategies. In my own analyses, I only use the fwd read from the EMP primers. Most of the taxonomic and phylogenetic signal is proximal to 515F as well, which is why studies like Yatsunenko et al 2012 Nature, which used 90 cycles if I recall correctly, still were quite exciting and compelling. In fact, quite a few of the analyses in the Thompson et al 2017 EMP paper were at 90nt too.

Best,
Daniel

3 Likes

Hi Daniel, thank you for this resource! Can you provide a brief instruction on how to use this database outside of QIIME? For instance, I'd prefer to use Kraken2 and I have both 16s and shotgun sequencing. I presume I need the 16s sequence database, the whole-genome sequence database, and the shared taxonomy, but I can't immediately tell which files these correspond to since there are many files in the FTP repository with similar descriptions.

2 Likes

Hi @John_McElderry,

For shotgun, we recommend using the Woltka toolkit. The genome identifiers in the database are relative to the Web of Life version 2. It is possible Kraken2 will work although we haven't evaluated that. The exact commands we use are buried in here; as an alternative, I would encourage considering depositing data into Qiita as that resource will take care of the compute.

Best,
Daniel

An off-topic reply has been split into a new topic: Importance of using consistance qiime2 versions with classifiers

Please keep replies on-topic in the future.