Introducing Greengenes2 2022.10

An off-topic reply has been split into a new topic: installing Greengenes2 in a minimal environment

Please keep replies on-topic in the future.

Hello @wasade - Very helpful information! I have a couple of questions. I have paired-end human stool data (processed with dada2) that is good quality through 250 not.
Do you know if there is any advantage to using the single-end versus paired-end data (i.e. the filter-features versus non-v4-16s approach)? And you mention trimming to 150nt (in the Deblur/Dada2 section) - is that a recommendation for filter-features and/or non-v4-16s? Thanks!

2 Likes

Hi @m_s,

If the sequences were generated using 515F-806R EMP primers, then you could trim them to 150nt and filter-features. If you'd prefer to keep the full length, then you'd need to use non-v4-16s.

I'm unaware of literature that has independently benchmarked the various read stitching strategies. In my own analyses, I only use the fwd read from the EMP primers. Most of the taxonomic and phylogenetic signal is proximal to 515F as well, which is why studies like Yatsunenko et al 2012 Nature, which used 90 cycles if I recall correctly, still were quite exciting and compelling. In fact, quite a few of the analyses in the Thompson et al 2017 EMP paper were at 90nt too.

Best,
Daniel

3 Likes

Hi Daniel, thank you for this resource! Can you provide a brief instruction on how to use this database outside of QIIME? For instance, I'd prefer to use Kraken2 and I have both 16s and shotgun sequencing. I presume I need the 16s sequence database, the whole-genome sequence database, and the shared taxonomy, but I can't immediately tell which files these correspond to since there are many files in the FTP repository with similar descriptions.

2 Likes

Hi @John_McElderry,

For shotgun, we recommend using the Woltka toolkit. The genome identifiers in the database are relative to the Web of Life version 2. It is possible Kraken2 will work although we haven't evaluated that. The exact commands we use are buried in here; as an alternative, I would encourage considering depositing data into Qiita as that resource will take care of the compute.

Best,
Daniel

An off-topic reply has been split into a new topic: Importance of using consistance qiime2 versions with classifiers

Please keep replies on-topic in the future.

@wasade
Hi!
Where can we obtain this file gg2-- label_ Map.qza ?

qiime greengenes2 relabel \
    --i-feature-table <your_feature_table> \
    --i-reference-label-map gg2-<version>-label_map.qza \
    --p-as-md5 \
    --o-relabeled-table <the_relabeled_table>
2 Likes

Hi @liang_zhou,

That action was created in preparation for an artifact type which I accidentally didn't export! Could you create an issue?

For the near term, what data do you need to relabel?

Best,
Daniel

1 Like

@wasade Thanks!

Here are my commands and the corresponding input/output files.

qiime greengenes2 non-v4-16s
--i-table table_filtered.qza
--i-sequences rep_seqs_filtered.qza
--i-backbone ../2022.10.backbone.full-length.fna.qza
--o-mapped-table table_filtered.gg2.biom.qza
--o-representatives rep_seqs_filtered.gg2.fna.qza

qiime greengenes2 taxonomy-from-table
--i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza
--i-table table_filtered.gg2.biom.qza
--o-classification table_filtered_gg2.taxonomy.qza

qiime greengenes2 taxonomy-from-features
--i-reference-taxonomy ../2022.10.taxonomy.asv.nwk.qza
--i-reads rep_seqs_filtered.gg2.fna.qza
--o-classification rep_seqs_filtered_gg2.taxonomy.qza

table_filtered.qza (52.2 KB)
rep_seqs_filtered.qza (423.8 KB)
table_filtered.gg2.biom.qza (45.3 KB)
table_filtered.gg2.biom.qza (45.3 KB)
rep_seqs_filtered_gg2.taxonomy.qza (48.8 KB)
table_filtered_gg2.taxonomy.qza (49.1 KB)
table_filtered.zip (24.0 KB)
table_filtered.gg2.zip (4.3 KB)

My question is:

When I use Greengenes2 to taxonomically characterize my Feature tables, the ID(ASV) of my feature tables will become the record IDs of Greengenes2.

Can I convert record IDs back to ASV through the parameter " --p-as-asv " of qiime greengenes2 relabel? If so, how can I do it.

1 Like

Hi @liang_zhou,

Thank you for the detail on the commands run!

When using non-v4-16s, the plugin executes the q2-vsearch closed reference OTU picking pipeline against the backbone sequences. The features expressed in the resulting table will be a subset of the backbone. The exact code applied is here.

With 16S V4 ASVs, if using filter-features, the resulting features will be a subset of the V4 ASVs that have previously been placed into Greengenes2.

I'm not aware of a means to do that through QIIME 2 right now. The q2-vsearch's cluster_features_closed_reference action does not appear to return the UC mapping that describes the query / subject relationship. That mapping is necessary to determine what ASVs recruit to which backbone records.

@gregcaporaso I don't think there is a mechanism right now with q2-vsearch to obtain the mapping detail. Would that be valuable? If so I'll open an issue to track it

Best,
Daniel

@wasade, you're right, there isn't right now. We have a type, FeatureMap, in q2-types-genomics intended for this type of mapping. We should move that over to q2-types so it's more generally accessible, and then use that for this purpose.

3 Likes

Thanks, @gregcaporaso! Created an issue on q2-types and q2-vsearch related to FeatureMap

2 Likes

@wasade @gregcaporaso Your help was very much appreciated.

2 Likes

Hello Daniel,

I am currently utilizing the "non-v4-16s" workflow and have encountered several challenges:

Unclassified Sequences: I've identified a number of sequences that remain unclassified within the workflow. Should there be any ID alterations for these sequences, how might I trace the unclassified species?

Feature Loss : I'm analyzing multiple regions and have observed a significant reduction in features, especially in regions like V1-V3 and V5-V7. While regions V3-V4 and V4 have a considerable number of classified ASVs/OTUs, they also appear to suffer from a loss of ASVs. The "non-v4-16s" workflow tends to classify at higher taxonomic levels. Could this be related to the feature loss I'm witnessing?

In terms of classifiers, would you advocate for the deployment of the full-length greengenes2 pre-constructed classifier when assessing features across diverse regions? Or would the extract-sequences approach with appropriate primers be more judicious? From the articule, I've discerned that the Naive Bayes classifier exhibits performance analogous to the phylogenetic classifier up to the genus level. Have you juxtaposed the classification performance of greengenes2 pre-constructed classifiers against other pre-constructed classifiers from others databases, such as the silva db?

I deeply appreciate your guidance on these matters.

Warm regards,
Benjamin Andres.

1 Like

Hi Benjamin,

Would you be able to share some of the sequences that remain unclassified, and the associated primer regions used for them? And what environment is being examined?

With loss, I would not assume that all variable regions are on an equal basis as the extent of conservation and variability can differ significantly. Additionally, this is through the lens of the exact primers used, and the intrinsic biases they introduce. non-v4-16s is just a wrapper around the cluster-features-closed-reference action of q2-vsearch, so implications there extend to Greengenes2. My guess is the backbone records picked are previously uncharacterized 16S from the operon datasets.

We do provide trained NB models on the full length data, which can be found on the Data Resources page. I'm unaware of any existing work that has compared integrating region specific data, but it is possible that using a fixed model could reduce technical differences.

In our NBT article, we did compare against the pretrained SILVA QIIME 2 NB model (see fig2). Note that we omitted polyphyletic labels for SILVA in the comparison as SILVA does not have them. Extending the comparison further is complex as comparing taxonomies themselves is difficult.

Best,
Daniel

Hi @wasade,
Using greengenes2 data into 2 different way (for V3-V4) is resulting in large differences in predicted ASVs. Can you suggest what is happening and which one might be better.

Method 1: using gg2 full length nb classifier directly:
qiime feature-classifier classify-sklearn --i-classifier 2022.10.backbone.v4.nb.qza --i-reads rep-seqs-dada2.qza --o-classification taxonomy.qza

Method2: using gg2 plugin
qiime greengenes2 non-v4-16s --i-table table-dada2.qza --i-sequences rep-seqs-dada2.qza --i-backbone 2022.10.backbone.full-length.fna.qza --o-mapped-table table_filtered.gg2.biom.qza --o-representatives rep_seqs_filtered.gg2.fna.qza
qiime greengenes2 taxonomy-from-table --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza --i-table table_filtered.gg2.biom.qza --o-classification table_filtered_gg2.taxonomy.qza

1 Like

Hi @Hitesh_Tikariha,

Thank you for the question! In method 1, the ASVs themselves are being classified. In method 2, a cluster by closed reference OTUs through vsearch is performed, which reduces the ASVs to observed members of the backbone. I'm assuming the differences observed are in the count of features remaining, and this would explain that.

We do not have data right now, that I'm aware of, to suggest whether one approach for your particular use is better than the other.

All the best,
Daniel

Does it means that method 1 generate ASV while method 2 generates OTU ?

2 Likes

Hi @Hitesh_Tikariha,

Method 1 retains the ASV feature space, whereas method 2 would express the data effectively as OTUs

Best,
Daniel

Hi!

Is there any way in which I can retrieve the amount of sequences that have been correctly assign to non-v4 regions with the greengenes2 non-v4-16s tool? I want to optimize my sequencing protocol so that I don't sequence redundant information all over again.

Thanks.