Introducing Greengenes2 2022.10

Uni · February 27, 2024, 8:27am

I have a simple question!

Should I utilize sequences that are mapped to the greengeens2 backbone (using the non-v4-16s function) as the input sequences for a Naive Bayes classifier? This classifier is trained on the V3-V4 regions of greengenes2.

Though I've been able to acquire taxonomic information using unmapped sequences, it seems that there's no method to obtain corresponding (with matching ASV names) phylogenetic information.

wasade · February 27, 2024, 8:24pm

Hi @Uni,

You could just use the taxonomy of the backbone records. We didn't place V3-V4 ASVs in the phylogeny so there won't be existing coordinates for them. We are working on a way to do the placement for arbitrary fragments but it isn't available just yet.

Best,
Daniel

Uni · February 28, 2024, 7:56am

Thank you for your reply.

So, if I use the V3-V4 ASVs, should not I utilize the provided phylogeny file (2022.10.phylogeny.asv.nwk.qza) for phylogenetic analysis?

Are there alternative methods for obtaining phylogeny [roots] when using V3-V4 ASVs with the Greengenes 2 database?

SoilRotifer · March 6, 2024, 4:29pm

An off-topic reply has been split into a new topic: ow can I specifically built a functional abundance table with picrust2?

Please keep replies on-topic in the future.

wasade · March 6, 2024, 4:10pm

Hi @uni,

For ASVs not based on 515f-806r, I would recommend using right now using the non-v4-16s action which performs closed reference OTU picking against the backbone. This would allow use of the phylogeny

Best,
Daniel

Alvaro_Lopez-Valinas · April 23, 2024, 4:18pm

Hey there,

First off, big thanks to everyone for your input, especially Daniel.

I've been diving into the analysis of my V3-V4 region with Greengenes2 and stumbled upon some divergent results depending on whether I classify directly using qiime greengenes2 filter-features or utilize the pre-trained classifier.

Here's the breakdown:

Using the command:

qiime greengenes2 non-v4-16s \
    --i-table dada2_files/filtered-table.qza \
    --i-sequences dada2_files/filtered-seqs.qza \
    --i-backbone /Volumes/Microbiota/MICROBIOTA_DATABASES/gg2/2022.10.backbone.full-length.fna.qza \
    --p-threads 7 \
    --o-mapped-table table-gg2.biom.qza \
    --o-representatives rep-seqs-gg2.fna.qza
qiime greengenes2 taxonomy-from-table \
    --i-reference-taxonomy /Volumes/Microbiota/MICROBIOTA_DATABASES/gg2/2022.10.taxonomy.asv.nwk.qza \
    --i-table table-gg2.biom.qza  \
    --o-classification taxonomy.qza

RESULT: 4644 entries (taxonomy.qzv).

Meanwhile, when I run:

qiime feature-classifier classify-sklearn \
  --i-classifier /Volumes/Microbiota/MICROBIOTA_DATABASES/gg2/2022.10.backbone.v4.nb.qza \
  --i-reads dada2_files/filtered-seqs.qza \
  --o-classification taxonomy.qza \
  --p-n-jobs -1

qiime tools export \
  --input-path taxonomy.qza \
  --output-path exported_taxonomy

RESULT: 26618 entries (taxonomy.qzv). This outcome aligns more closely with results I get when using the SILVA database.

How can there be such a discrepancy when we're supposedly employing the same database for taxonomic classification? Is there something I'm misunderstanding?

What's even more curious is that when I generate relative frequency tables at level 7, I get 2616 entries in the first case, compared to 1908 in the second case.

Thank you so much in advanced

wasade · April 23, 2024, 8:27pm

Hey @Alvaro_Lopez-Valinas,

Thanks for the kind words, and thanks for reaching out!

In the non-v4-16s case, it's performing closed reference recruitment with q2-vsearch behind the scenes. Different ASVs could recruit to the same backbone, and some ASVs may not recruit at all. In the classify-sklearn example, the ASVs themselves are being classified, where similar (but different) ASVs may each get a label, and it can potentially characterize ASVs which are too divergent to recruit to the backbone.

Does that make sense?

Best,
Daniel

Brintha · May 1, 2024, 3:16pm

Hi @John_McElderry,

I am also looking for a similar thing using Kraken2. For now I have extracted the V3V4 region from the greengene2 sequences and tried to build the database using Kraken2. Though database building is successful, none of the sequences are classified. Any thoughts on this? Did you face similar issue?

wasade · May 1, 2024, 3:34pm

Hi @Brintha

Does the comment here resolve your questions? I see the issue opened on the Github tracker as well -- if there are further questions unrelated to QIIME 2, it would help to have them directed to Github.

All the best,
Daniel

Brintha · May 2, 2024, 4:24am

Thank you @wasade. I will look into Qiita.

iptz1 · June 21, 2024, 5:22pm

Hello again. Sorry for a new question on the same topic, but I just want to be 100% sure of doing the things right! Is the following code correct to train a classifier on v3-v4 paired end reads? Or do I have to use the pre-trained classifier on the full-length data?

qiime feature-classifier extract-reads \
--i-sequences 2022.10.backbone.full-length.fna.qza \
--p-f-primer CCTACGGGNBGCASCAG \
--p-r-primer GACTACNVGGGTATCTAATCC \
--p-read-orientation both \
--p-min-length 399 \
--p-max-length 430 \
--o-reads ref-seqs-classifier_GG2.qza \
--p-n-jobs 16

qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ref-seqs-classifier_GG2.qza \
--i-reference-taxonomy 2022.10.backbone.tax.qza \
--o-classifier V3V4_GG2.qza

And also: if using the non-v4-16s action, can i give as input to qiime greengenes2 non-v4-16s the table and sequences obtained from qiime dada2 denoise-paired? Would this method be preferable or not?

Thank you again for you time and kind suggestions!

wasade · June 25, 2024, 4:37pm

Hi @iptz1,

I'm not familiar with the --p-min-length and --p-max-length options, but from looking at the help text, I probably would relax them further. The coordinates for your primers are probably relative to E coli and it is plausible there may be variation in length in that region, but that is just a guess. That said, on the surface the commands seem reasonable. The commands used to construct the V4 classifier can be found here.

The non-v4-16s action is a thin wrapper around q2-vsearch's cluster closed reference action -- providing stitched reads from DADA2 would work

All the best,
Daniel

arwqiime · August 6, 2024, 2:30pm

Hello @wasade
Thank you for the GG2 release last year.
I have used the pretrained 515F-806R classifier for 16S-V4 amplicons from lichen samples. The primers target prokaryotic organisms in this microhabitate, including 16S regions of the chloroplast genome.

The silva 138.1 classifiers reports these features as (only one example given):
f577632a80f935428b6c9117d8075eb3 d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Chloroplast; f__Chloroplast; g__Chloroplast

while the gg2 classifier reports it as (same feature as above):
f577632a80f935428b6c9117d8075eb3 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__Cyanobacteriales

The overall classification into the p__Cyanobacteria phylum is very similar with the two classifiers (76 features using silva, 72 feature using gg2).

As the majority of these features are classified as g__Chloroplast using silva classifier, we assume that they originate from algal plastids present in these samples (there is always a very dominating feature originating most likely from the main algae partner of the lichens).

How do I have to interpret the gg2 taxonomy of these features? I have not seen any 'chloroplast' annotation in gg2 taxonomy. Are these sequences not present, or were assigned using a more sophisticated taxonomy rule or naming scheme?

Thank you for your comments!
Best,

wasade · August 26, 2024, 11:27pm

Hi @arwqiime,

We included the SILVA set of chloroplast sequences in Greengenes2 during DEPP placement, but not on the topology update step with uDance out of concern the sequences (which differ appreciably from bacterial/archaea 16S) would have detrimental effects. The taxonomy decoration phase was based off the records in the backbone topology, and those used for topology updates, but not the records used for placement. As a result, the taxonomy decoration did not explicitly include the records sourced from SILVA leading to a deficient taxonomy for chloroplast. I'm actively working on an update for this, and I apologize for any inconvenience.

Best,
Daniel

Ruitao_Liu · September 23, 2024, 5:50pm

Hi,
When I processed greengenes2 with output of Woltka, since the feature id is like "G000005825", I got error that there is no match labels for this format. How can I use the output of woltka in greengenes2?

Thanks!

wasade · September 24, 2024, 3:23pm

Hi @Ruitao_Liu,

That feature is present. Using a woltka OGU table mapped against WoLr2 is essentially qiime greengenes2 filter-features ...

Best,
Daniel

arwqiime · October 2, 2024, 8:31am

Hi @wasade
Thank you for working on an update (sorry for the delayed reply, I was out of country).
Will you post a short notice here if you are done with it?
Best,

wasade · October 8, 2024, 8:09pm

@arwqiime, yup Please see the announcement.

Alyssa_Kaganer · October 11, 2024, 8:10pm

Hello!

I'm reaching out to see if FeatureMap functionality is now available in q2- types for use to address the challenge raised above re: seeking gg2-- label_ Map.qza to for relabeling? I've searched through the forum & am struggling to find instructions on how to apply this fix.

Many thanks for any help/clarification you can share!

Cheers,
A

wasade · October 11, 2024, 10:01pm

Hi @Alyssa_Kaganer,

That issue in q2-vsearch appears to still be open.

Best,
Daniel