Introducing Greengenes2 2022.10

I have a simple question!

Should I utilize sequences that are mapped to the greengeens2 backbone (using the non-v4-16s function) as the input sequences for a Naive Bayes classifier? This classifier is trained on the V3-V4 regions of greengenes2.

Though I've been able to acquire taxonomic information using unmapped sequences, it seems that there's no method to obtain corresponding (with matching ASV names) phylogenetic information.

Hi @Uni,

You could just use the taxonomy of the backbone records. We didn't place V3-V4 ASVs in the phylogeny so there won't be existing coordinates for them. We are working on a way to do the placement for arbitrary fragments but it isn't available just yet.

Best,
Daniel

1 Like

Thank you for your reply.

So, if I use the V3-V4 ASVs, should not I utilize the provided phylogeny file (2022.10.phylogeny.asv.nwk.qza) for phylogenetic analysis?

Are there alternative methods for obtaining phylogeny [roots] when using V3-V4 ASVs with the Greengenes 2 database?

An off-topic reply has been split into a new topic: ow can I specifically built a functional abundance table with picrust2?

Please keep replies on-topic in the future.

Hi @uni,

For ASVs not based on 515f-806r, I would recommend using right now using the non-v4-16s action which performs closed reference OTU picking against the backbone. This would allow use of the phylogeny

Best,
Daniel

2 Likes

Hey there,

First off, big thanks to everyone for your input, especially Daniel.

I've been diving into the analysis of my V3-V4 region with Greengenes2 and stumbled upon some divergent results depending on whether I classify directly using qiime greengenes2 filter-features or utilize the pre-trained classifier.

Here's the breakdown:

Using the command:

qiime greengenes2 non-v4-16s \
    --i-table dada2_files/filtered-table.qza \
    --i-sequences dada2_files/filtered-seqs.qza \
    --i-backbone /Volumes/Microbiota/MICROBIOTA_DATABASES/gg2/2022.10.backbone.full-length.fna.qza \
    --p-threads 7 \
    --o-mapped-table table-gg2.biom.qza \
    --o-representatives rep-seqs-gg2.fna.qza
qiime greengenes2 taxonomy-from-table \
    --i-reference-taxonomy /Volumes/Microbiota/MICROBIOTA_DATABASES/gg2/2022.10.taxonomy.asv.nwk.qza \
    --i-table table-gg2.biom.qza  \
    --o-classification taxonomy.qza

RESULT: 4644 entries (taxonomy.qzv).

Meanwhile, when I run:

qiime feature-classifier classify-sklearn \
  --i-classifier /Volumes/Microbiota/MICROBIOTA_DATABASES/gg2/2022.10.backbone.v4.nb.qza \
  --i-reads dada2_files/filtered-seqs.qza \
  --o-classification taxonomy.qza \
  --p-n-jobs -1

qiime tools export \
  --input-path taxonomy.qza \
  --output-path exported_taxonomy

RESULT: 26618 entries (taxonomy.qzv). This outcome aligns more closely with results I get when using the SILVA database.

How can there be such a discrepancy when we're supposedly employing the same database for taxonomic classification? Is there something I'm misunderstanding?

What's even more curious is that when I generate relative frequency tables at level 7, I get 2616 entries in the first case, compared to 1908 in the second case.

Thank you so much in advanced

3 Likes

Hey @Alvaro_Lopez-Valinas,

Thanks for the kind words, and thanks for reaching out!

In the non-v4-16s case, it's performing closed reference recruitment with q2-vsearch behind the scenes. Different ASVs could recruit to the same backbone, and some ASVs may not recruit at all. In the classify-sklearn example, the ASVs themselves are being classified, where similar (but different) ASVs may each get a label, and it can potentially characterize ASVs which are too divergent to recruit to the backbone.

Does that make sense?

Best,
Daniel

Hi @John_McElderry,

I am also looking for a similar thing using Kraken2. For now I have extracted the V3V4 region from the greengene2 sequences and tried to build the database using Kraken2. Though database building is successful, none of the sequences are classified. Any thoughts on this? Did you face similar issue?

Hi @Brintha

Does the comment here resolve your questions? I see the issue opened on the Github tracker as well -- if there are further questions unrelated to QIIME 2, it would help to have them directed to Github.

All the best,
Daniel

1 Like

Thank you @wasade. I will look into Qiita.

1 Like

Hello again. Sorry for a new question on the same topic, but I just want to be 100% sure of doing the things right! Is the following code correct to train a classifier on v3-v4 paired end reads? Or do I have to use the pre-trained classifier on the full-length data?

qiime feature-classifier extract-reads \
--i-sequences 2022.10.backbone.full-length.fna.qza \
--p-f-primer CCTACGGGNBGCASCAG \
--p-r-primer GACTACNVGGGTATCTAATCC \
--p-read-orientation both \
--p-min-length 399 \
--p-max-length 430 \
--o-reads ref-seqs-classifier_GG2.qza \
--p-n-jobs 16

qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ref-seqs-classifier_GG2.qza \
--i-reference-taxonomy 2022.10.backbone.tax.qza \
--o-classifier V3V4_GG2.qza

And also: if using the non-v4-16s action, can i give as input to qiime greengenes2 non-v4-16s the table and sequences obtained from qiime dada2 denoise-paired? Would this method be preferable or not?

Thank you again for you time and kind suggestions!

3 Likes

Hi @iptz1,

I'm not familiar with the --p-min-length and --p-max-length options, but from looking at the help text, I probably would relax them further. The coordinates for your primers are probably relative to E coli and it is plausible there may be variation in length in that region, but that is just a guess. That said, on the surface the commands seem reasonable. The commands used to construct the V4 classifier can be found here.

The non-v4-16s action is a thin wrapper around q2-vsearch's cluster closed reference action -- providing stitched reads from DADA2 would work

All the best,
Daniel

Hello @wasade
Thank you for the GG2 release last year.
I have used the pretrained 515F-806R classifier for 16S-V4 amplicons from lichen samples. The primers target prokaryotic organisms in this microhabitate, including 16S regions of the chloroplast genome.

The silva 138.1 classifiers reports these features as (only one example given):
f577632a80f935428b6c9117d8075eb3 d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Chloroplast; f__Chloroplast; g__Chloroplast

while the gg2 classifier reports it as (same feature as above):
f577632a80f935428b6c9117d8075eb3 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__Cyanobacteriales

The overall classification into the p__Cyanobacteria phylum is very similar with the two classifiers (76 features using silva, 72 feature using gg2).

As the majority of these features are classified as g__Chloroplast using silva classifier, we assume that they originate from algal plastids present in these samples (there is always a very dominating feature originating most likely from the main algae partner of the lichens).

How do I have to interpret the gg2 taxonomy of these features? I have not seen any 'chloroplast' annotation in gg2 taxonomy. Are these sequences not present, or were assigned using a more sophisticated taxonomy rule or naming scheme?

Thank you for your comments!
Best,

1 Like

Hi @arwqiime,

We included the SILVA set of chloroplast sequences in Greengenes2 during DEPP placement, but not on the topology update step with uDance out of concern the sequences (which differ appreciably from bacterial/archaea 16S) would have detrimental effects. The taxonomy decoration phase was based off the records in the backbone topology, and those used for topology updates, but not the records used for placement. As a result, the taxonomy decoration did not explicitly include the records sourced from SILVA leading to a deficient taxonomy for chloroplast. I'm actively working on an update for this, and I apologize for any inconvenience.

Best,
Daniel

Hi,
When I processed greengenes2 with output of Woltka, since the feature id is like "G000005825", I got error that there is no match labels for this format. How can I use the output of woltka in greengenes2?

Thanks!

Hi @Ruitao_Liu,

That feature is present. Using a woltka OGU table mapped against WoLr2 is essentially qiime greengenes2 filter-features ...

Best,
Daniel

Hi @wasade
Thank you for working on an update (sorry for the delayed reply, I was out of country).
Will you post a short notice here if you are done with it?
Best,