Introducing Greengenes2 2022.10

Hi @sgalera,

The underlying action from q2-vsearch does not yet expose the FeatureMap, see the open issue for further detail.

All the best,
Daniel

Hi everyone!

If I use qiime greengenes2 taxonomy-from-table command, it uses the DEPP to place each sequence into the tree and return the taxonomy information. Am I right?

1 Like

Hi @liangyong19491001,

The 2022.10 release has around 20 million V4 ASVs already placed with DEPP. We do not yet have an exposed pipeline for running DEPP independently

Best,
Daniel

1 Like

Hi @wasade , I test the tool using the following command:

qiime tools import \
--type 'SampleData[SequencesWithQuality]' \
--input-path all_manifest.txt \
--output-path ./all_seqs.qza \
--input-format SingleEndFastqManifestPhred33

qiime dada2 denoise-single \
--i-demultiplexed-seqs  ./all_seqs.qza \
--p-trim-left 30 \
  --p-trunc-len 150 \
  --o-representative-sequences all_rep_seqs.qza \
  --o-table all_table.qza \
  --o-denoising-stats all_denoising_stats.qza \
  --p-hashed-feature-ids False \
  --p-n-threads 40 

qiime greengenes2 filter-features \
     --i-feature-table all_table.qza \
     --i-reference ../2022.10.taxonomy.asv.nwk.qza \
     --o-filtered-feature-table all_gg2.biom.qza

qiime greengenes2 taxonomy-from-table \
     --i-reference-taxonomy ../2022.10.taxonomy.asv.nwk.qza \
     --i-table all_gg2.biom.qza \
     --o-classification all_icu_gg2.taxonomy.qza

When I checked the result of filter-features, there was no sequence left. Is there anything wrong with my code? The data is 16S V4 and I use the forward read and trim the reads using --p-trim-left 30 --p-trunc-len 150.

1 Like

Thank you @wasade!

So, what does the command do is doing a query jobs? I input the 16S V4 sequence and its search this sequence against the 20M V4 ASV, then return the taxonomy information? If yes, what if the input sequence do not exist in the 20M ASV?

Hi @liangyong19491001,

Most 515F V4 ASVs will start with TAC.... If following the EMP protocol, the fwd primer will not be present. If following a variant of the EMP protocol, the fwd primer may still exist. If your ASV do not tend to start with TAC, then the left trim may be incorrect. The filter-features command is performing an exact match, so if the ASVs are off by even a single nucleotide, they will not be found.

Best,
Daniel

Hi @liangyong19491001,

In practice for EMP V4 data, we've seen only a small number of ASVs not hit and they tend to be singletons. The ASV set represented here spans > 300,000 public and private microbiome samples from a large number of environments.

We are working on adapting DEPP so users can place their own fragments, but that is not available yet.

All the best,
Daniel

Many thanks for your detail answer! In my previous test, the ASV was started with the 515F primer, which starts with GTG, cause I have seen your answer from Greengenes2 taxonomy-from-table error - #9 by cannon.320, which says it should keep the 515F primer. Thanks again and I will test it tomorrow.

Hi @liangyong19491001,

You do not want to keep the 515F primer, I apologize if that prior thread suggested that but I don't immediately see where. The 515F primer does start with GTG: GTGYCAGCMGCCGCGGTAA

Best,
Daniel

Thank you @wasade , in the prior thread it say not trim the 5’ and I misunderstand as keep the 515F primer :grin:

Hi!
Thank you for the tutorial. I have long nanopore reads (>1200). So I used this command to classify the reads:

$ qiime greengenes2 non-v4-16s \
>    --i-table icu.biom.qza \
>    --i-sequences icu.fna.qza \
>    --i-backbone 2022.10.backbone.full-length.fna.qza \
>    --o-mapped-table icu.gg2.biom.qza \
>    --o-representatives icu.gg2.fna.qza


$ qiime greengenes2 taxonomy-from-table \
>     --i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza \
>     --i-table icu.gg2.biom.qza \
>     --o-classification icu.gg2.taxonomy.qza

With my own input files (backbone is the same).

By default, --p-perc-identity is set to 0.99. I run the same command with adjusted values 0.85, 0.9 and 0.97 to account for high errors rate in the nanopore data. Of course, I am getting the most "beautiful" results with 0.85, but I am afraid that it contains to much of false positives. Could you advise which of the thresholds is the best for working with nanopore data? Unfortunately, I don't have standards in the dataset to compare.

3 Likes

Hi @timanix,

I'm not sure what type of data may exist to inform what a reasonable similarity threshold is for these type of data. If you have positive controls, it may be feasible to estimate similarity threshold by aligning against the known 16S. If you do not have positive controls, then I wonder if it could be estimated by examining divergence relative to invariant positions in 16S, although we don't have that type of detail readily available within Greengenes2 right now.

If you vary the similarity threshold, say from 0.85 to 0.99, do the biological conclusions derived or sample-sample relationships change?

Best,
Daniel

1 Like

Hi @wasade
Thank you for the reply!
Unfortunately, in the dataset I am working with there are no positive controls or standards sequenced. I saw in the literature that 85% of Identity is used for VSEARCH for taxonomy annotation of nanopore data, but just in case I decided to check here if there are some recommendations.

I have it for 85%, 90% and 97%. There are differences in the number of sequences retained/annotated after running gg2 plugin. I can share them if you are interested and would like a better overview of the differences.

1 Like

An off-topic reply has been split into a new topic: Taxonomy filtering Greengenes2

Please keep replies on-topic in the future.

Hi @timanix,

Sorry for the delay in reply. With the results, what I'm curious about specifically are whether e.g., PERMANOVA statistics for variables of interest differ depending on the threshold used. If they do not, then it suggests the biological signal being tested is robust to this threshold. Does that make sense?

Best,
Daniel

1 Like

Hi @wasade
There were almost no differences between 85% and 90% in PCoA plots and PERMANOVA of important variables, while 97% was quite different (not surprising with less than 10% sequences retained). But I tested it with collapsed to species level features since sequences are not original but replaced with sequences from the database, so we collapsed it to species for core-metrics and for the rest, we will go no higher than genus level. We will proceed with 90% for now, but in future we will test it with standarts.

2 Likes

Thank you for the follow up!

Best,
Daniel

2 Likes

An off-topic reply has been split into a new topic: installing and using q2-greengenes plugin

Please keep replies on-topic in the future.

I would like to know if a genus has multiple taxonomic labels like g__Blautia_A_141781 and g__Blautia_A_141770 , should I treat them as separate genera or species for conducting differential analysis? Alternatively, should I remove the nodes like _141770 from the taxonomic labels and merge them into the same genus for the differential analysis?

Hi @bylam,

The taxonomy reflects the phylogeny, and collapsing as proposed would disrupt that relationship. These types of decisions ultimately depend on what question you are asking with a particular analysis, and importantly, whether they matter to the type of interpretations you can draw from the result.

Best,
Daniel