What type of database to use for q2-fragment-insertion

aghudson · August 5, 2023, 5:37am

Hi,

I would like to use the q2-fragment-insertion plugin to create a more robust phylogenetic tree for alpha and beta diversity analysis of my 16S V4 data.

Having read through the information at https://library.qiime2.org/plugins/q2-fragment-insertion/16/, I am a bit confused as to what I should use as my reference database. Should I use the reference database that I used to annotate my ASVs that I created using RESCRIPt (based on Silva 138.1) or should I use a specially formatted version of this database for SEPP? If it is the latter, where do I find this file?

Thanks in advance for any advice,

Alan

crusher083 · August 5, 2023, 1:56pm

Hello Alan,

First of all, I would advise using Greengenes2 instead of Silva: this is a more comprehensive database of V4, tutorial here Introducing Greengenes2 2022.10.
If you still want to do SEPP, just go to: Data resources — QIIME 2 2023.5.1 documentation. The tree for SEPP wasn't updated since 2016 though.

Cheers,
V

aghudson · August 5, 2023, 10:43pm

Hi Valentyn,

Thanks for the very interesting response. I did not know that Greengenes had been updated. I will read through the publication.

If I was to use this database, would you recommend going back and repeating my taxonomic annotation using the new Greengenes database? I previously used RESCRIPt to do the taxonomic annotation would this be redundant now?

For phylogenetic reconstruction, does it allow you to integrate your own ASVs into the Greengene phylogeny. Is that how it works?

Thanks again,

Alan

crusher083 · August 6, 2023, 7:59am

Re1: Yes, you will need to repeat all taxonomical classification with GG2. RESCRIPt is not a taxonomical assignment tool in itself, but a tool for sequence database preprocessing. For bacterial reads, there is usually SILVA preprocessed with RESCRIPt, and SILVA is incorporated into GG2.
Re2: GG2 ship a huge phylogenetic tree, which is then referenced for phylogenetic analysis (GG2 is basically a huge tree of all ASV sequenced before).

Cheers
V

wasade · August 7, 2023, 3:38pm

Hi Valentyn and Alan,

We are examining right now what is necessary to facilitate users placing their own fragments, but we do not have an ETA yet. That said, the current V4 representation in Greengenes2 at 90, 100, and 150 nucleotides in length is appreciable.

Best,
Daniel

aghudson · August 11, 2023, 7:36am

Hi @wasade,

Thanks for your input.

The commands were pretty simple to implement through the qiime plugin. After using greengenes2 for taxonomic annotation, compared to the previous Silva-RESCRIPt workflow, ended up with far fewer ASVs, do you know why this might be and whether this could be a problem? I am working with fish gut microbiomes.

Another issue that I have is that following gg2 annotation, all the annotated ASVs are automatically renamed. While this is fine, it means I lose the representative sequences for each ASV and I was wondering if there was a way of conserving this link between annotated ASV and its representative sequence.

Best wishes,

Alan

wasade · August 11, 2023, 3:22pm

Hi @aghudson,

Thank you for the update!

It seems surprising that many ASVs are dropping. Are the per-sample read counts dropping much? I ask because in other scenarios, I've seen the number of ASVs drop a fair amount however the retained sequence data is still like > 99%, suggesting most lost were singletons.

The mapping for non-v4-16s is a known issue. The underlying plugin, q2-vsearch, does not emit the mapping information so that the moment we cannot readily expose that. We've recently opened an issue about it, and the seed of that discussion can be found here (Introducing Greengenes2 2022.10 - #34 by wasade).

Best,
Daniel

aghudson · August 26, 2023, 8:55pm

I need to check the per-sample read checks but I have some (hopefully) quick questions about how the Greengenes2 annotation works. As I understand it, the annotation works on a 99% sequence similarity. So, my query 16S V4 sequences will be assigned to a certain taxa if they are within 99% nucleotide sequence similarity to that taxa in the database, is that correct? So are sequences that fall outside this 99% similarity then excluded from the annotated data set? Does this mean that the annotated sequences are strictly OTUs rather than ASVs and thus several ASVs from the unannotated sequence list could be combined under a single OTU in the resulting annotated dataset?

Thanks again,

Alan

aghudson · August 31, 2023, 1:46am

Hi @wasade,

I noticed that the Firmicutes are now spit into two phyla Firmicutes "A" and Firmicutes "D". Is this widely accpeted as taxonomic nomenclatures? I have not seen this naming convention elsewhere.

Best wishes,

Alan

wasade · August 31, 2023, 3:11pm

Hi @aghudson,

The the primary splits (e.g., _A) are derived from GTDB, see their note. The secondary splits (e.g., _1234) are based on the de novo Greengenes2 phylogeny. Greengenes2 expresses a phylogenetically supported taxonomy.

All the best,
Daniel

aghudson · September 8, 2023, 7:44am

Thanks @wasade

Sorry to bother you again but do you have an answer to my previous question as to whether the annotated sequences are strictly OTUs rather than ASVs and thus whether several ASVs from the unannotated sequence list could be combined under a single OTU in the resulting annotated dataset?

aghudson · September 8, 2023, 8:21am

Apologies, reading the forum questions on Greengenes2, it appears that they are indeed OTUs rather than ASVs.

wasade · September 8, 2023, 3:37pm

Hi @aghudson,

Sorry for missing that! Not sure what happened.

The non-v4-16s action performs clusters by closed reference, so it would collapse multiple ASVs.

For non V4 data, you can stay in ASV space by using the full length Naive Bayes classifier.

Best,
Daniel

system · October 9, 2023, 9:38pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.