Introducing Greengenes2 2022.10

Hello Daniel,

I am currently utilizing the "non-v4-16s" workflow and have encountered several challenges:

Unclassified Sequences: I've identified a number of sequences that remain unclassified within the workflow. Should there be any ID alterations for these sequences, how might I trace the unclassified species?

Feature Loss : I'm analyzing multiple regions and have observed a significant reduction in features, especially in regions like V1-V3 and V5-V7. While regions V3-V4 and V4 have a considerable number of classified ASVs/OTUs, they also appear to suffer from a loss of ASVs. The "non-v4-16s" workflow tends to classify at higher taxonomic levels. Could this be related to the feature loss I'm witnessing?

In terms of classifiers, would you advocate for the deployment of the full-length greengenes2 pre-constructed classifier when assessing features across diverse regions? Or would the extract-sequences approach with appropriate primers be more judicious? From the articule, I've discerned that the Naive Bayes classifier exhibits performance analogous to the phylogenetic classifier up to the genus level. Have you juxtaposed the classification performance of greengenes2 pre-constructed classifiers against other pre-constructed classifiers from others databases, such as the silva db?

I deeply appreciate your guidance on these matters.

Warm regards,
Benjamin Andres.

2 Likes

Hi Benjamin,

Would you be able to share some of the sequences that remain unclassified, and the associated primer regions used for them? And what environment is being examined?

With loss, I would not assume that all variable regions are on an equal basis as the extent of conservation and variability can differ significantly. Additionally, this is through the lens of the exact primers used, and the intrinsic biases they introduce. non-v4-16s is just a wrapper around the cluster-features-closed-reference action of q2-vsearch, so implications there extend to Greengenes2. My guess is the backbone records picked are previously uncharacterized 16S from the operon datasets.

We do provide trained NB models on the full length data, which can be found on the Data Resources page. I'm unaware of any existing work that has compared integrating region specific data, but it is possible that using a fixed model could reduce technical differences.

In our NBT article, we did compare against the pretrained SILVA QIIME 2 NB model (see fig2). Note that we omitted polyphyletic labels for SILVA in the comparison as SILVA does not have them. Extending the comparison further is complex as comparing taxonomies themselves is difficult.

Best,
Daniel

Hi @wasade,
Using greengenes2 data into 2 different way (for V3-V4) is resulting in large differences in predicted ASVs. Can you suggest what is happening and which one might be better.

Method 1: using gg2 full length nb classifier directly:
qiime feature-classifier classify-sklearn --i-classifier 2022.10.backbone.v4.nb.qza --i-reads rep-seqs-dada2.qza --o-classification taxonomy.qza

Method2: using gg2 plugin
qiime greengenes2 non-v4-16s --i-table table-dada2.qza --i-sequences rep-seqs-dada2.qza --i-backbone 2022.10.backbone.full-length.fna.qza --o-mapped-table table_filtered.gg2.biom.qza --o-representatives rep_seqs_filtered.gg2.fna.qza
qiime greengenes2 taxonomy-from-table --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza --i-table table_filtered.gg2.biom.qza --o-classification table_filtered_gg2.taxonomy.qza

1 Like

Hi @Hitesh_Tikariha,

Thank you for the question! In method 1, the ASVs themselves are being classified. In method 2, a cluster by closed reference OTUs through vsearch is performed, which reduces the ASVs to observed members of the backbone. I'm assuming the differences observed are in the count of features remaining, and this would explain that.

We do not have data right now, that I'm aware of, to suggest whether one approach for your particular use is better than the other.

All the best,
Daniel

Does it means that method 1 generate ASV while method 2 generates OTU ?

2 Likes

Hi @Hitesh_Tikariha,

Method 1 retains the ASV feature space, whereas method 2 would express the data effectively as OTUs

Best,
Daniel

Hi!

Is there any way in which I can retrieve the amount of sequences that have been correctly assign to non-v4 regions with the greengenes2 non-v4-16s tool? I want to optimize my sequencing protocol so that I don't sequence redundant information all over again.

Thanks.

Hi @sgalera,

The underlying action from q2-vsearch does not yet expose the FeatureMap, see the open issue for further detail.

All the best,
Daniel

Hi everyone!

If I use qiime greengenes2 taxonomy-from-table command, it uses the DEPP to place each sequence into the tree and return the taxonomy information. Am I right?

1 Like

Hi @liangyong19491001,

The 2022.10 release has around 20 million V4 ASVs already placed with DEPP. We do not yet have an exposed pipeline for running DEPP independently

Best,
Daniel

1 Like

Hi @wasade , I test the tool using the following command:

qiime tools import \
--type 'SampleData[SequencesWithQuality]' \
--input-path all_manifest.txt \
--output-path ./all_seqs.qza \
--input-format SingleEndFastqManifestPhred33

qiime dada2 denoise-single \
--i-demultiplexed-seqs  ./all_seqs.qza \
--p-trim-left 30 \
  --p-trunc-len 150 \
  --o-representative-sequences all_rep_seqs.qza \
  --o-table all_table.qza \
  --o-denoising-stats all_denoising_stats.qza \
  --p-hashed-feature-ids False \
  --p-n-threads 40 

qiime greengenes2 filter-features \
     --i-feature-table all_table.qza \
     --i-reference ../2022.10.taxonomy.asv.nwk.qza \
     --o-filtered-feature-table all_gg2.biom.qza

qiime greengenes2 taxonomy-from-table \
     --i-reference-taxonomy ../2022.10.taxonomy.asv.nwk.qza \
     --i-table all_gg2.biom.qza \
     --o-classification all_icu_gg2.taxonomy.qza

When I checked the result of filter-features, there was no sequence left. Is there anything wrong with my code? The data is 16S V4 and I use the forward read and trim the reads using --p-trim-left 30 --p-trunc-len 150.

1 Like

Thank you @wasade!

So, what does the command do is doing a query jobs? I input the 16S V4 sequence and its search this sequence against the 20M V4 ASV, then return the taxonomy information? If yes, what if the input sequence do not exist in the 20M ASV?

Hi @liangyong19491001,

Most 515F V4 ASVs will start with TAC.... If following the EMP protocol, the fwd primer will not be present. If following a variant of the EMP protocol, the fwd primer may still exist. If your ASV do not tend to start with TAC, then the left trim may be incorrect. The filter-features command is performing an exact match, so if the ASVs are off by even a single nucleotide, they will not be found.

Best,
Daniel

Hi @liangyong19491001,

In practice for EMP V4 data, we've seen only a small number of ASVs not hit and they tend to be singletons. The ASV set represented here spans > 300,000 public and private microbiome samples from a large number of environments.

We are working on adapting DEPP so users can place their own fragments, but that is not available yet.

All the best,
Daniel

Many thanks for your detail answer! In my previous test, the ASV was started with the 515F primer, which starts with GTG, cause I have seen your answer from Greengenes2 taxonomy-from-table error - #9 by cannon.320, which says it should keep the 515F primer. Thanks again and I will test it tomorrow.

Hi @liangyong19491001,

You do not want to keep the 515F primer, I apologize if that prior thread suggested that but I don't immediately see where. The 515F primer does start with GTG: GTGYCAGCMGCCGCGGTAA

Best,
Daniel

Thank you @wasade , in the prior thread it say not trim the 5’ and I misunderstand as keep the 515F primer :grin:

Hi!
Thank you for the tutorial. I have long nanopore reads (>1200). So I used this command to classify the reads:

$ qiime greengenes2 non-v4-16s \
>    --i-table icu.biom.qza \
>    --i-sequences icu.fna.qza \
>    --i-backbone 2022.10.backbone.full-length.fna.qza \
>    --o-mapped-table icu.gg2.biom.qza \
>    --o-representatives icu.gg2.fna.qza


$ qiime greengenes2 taxonomy-from-table \
>     --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza \
>     --i-table icu.gg2.biom.qza \
>     --o-classification icu.gg2.taxonomy.qza

With my own input files (backbone is the same).

By default, --p-perc-identity is set to 0.99. I run the same command with adjusted values 0.85, 0.9 and 0.97 to account for high errors rate in the nanopore data. Of course, I am getting the most "beautiful" results with 0.85, but I am afraid that it contains to much of false positives. Could you advise which of the thresholds is the best for working with nanopore data? Unfortunately, I don't have standards in the dataset to compare.

3 Likes

Hi @timanix,

I'm not sure what type of data may exist to inform what a reasonable similarity threshold is for these type of data. If you have positive controls, it may be feasible to estimate similarity threshold by aligning against the known 16S. If you do not have positive controls, then I wonder if it could be estimated by examining divergence relative to invariant positions in 16S, although we don't have that type of detail readily available within Greengenes2 right now.

If you vary the similarity threshold, say from 0.85 to 0.99, do the biological conclusions derived or sample-sample relationships change?

Best,
Daniel

1 Like

Hi @wasade
Thank you for the reply!
Unfortunately, in the dataset I am working with there are no positive controls or standards sequenced. I saw in the literature that 85% of Identity is used for VSEARCH for taxonomy annotation of nanopore data, but just in case I decided to check here if there are some recommendations.

I have it for 85%, 90% and 97%. There are differences in the number of sequences retained/annotated after running gg2 plugin. I can share them if you are interested and would like a better overview of the differences.

1 Like