I would like to use the q2-fragment-insertion plugin to create a more robust phylogenetic tree for alpha and beta diversity analysis of my 16S V4 data.
Having read through the information at QIIME 2 Library, I am a bit confused as to what I should use as my reference database. Should I use the reference database that I used to annotate my ASVs that I created using RESCRIPt (based on Silva 138.1) or should I use a specially formatted version of this database for SEPP? If it is the latter, where do I find this file?
Thanks for the very interesting response. I did not know that Greengenes had been updated. I will read through the publication.
If I was to use this database, would you recommend going back and repeating my taxonomic annotation using the new Greengenes database? I previously used RESCRIPt to do the taxonomic annotation would this be redundant now?
For phylogenetic reconstruction, does it allow you to integrate your own ASVs into the Greengene phylogeny. Is that how it works?
Re1: Yes, you will need to repeat all taxonomical classification with GG2. RESCRIPt is not a taxonomical assignment tool in itself, but a tool for sequence database preprocessing. For bacterial reads, there is usually SILVA preprocessed with RESCRIPt, and SILVA is incorporated into GG2.
Re2: GG2 ship a huge phylogenetic tree, which is then referenced for phylogenetic analysis (GG2 is basically a huge tree of all ASV sequenced before).
We are examining right now what is necessary to facilitate users placing their own fragments, but we do not have an ETA yet. That said, the current V4 representation in Greengenes2 at 90, 100, and 150 nucleotides in length is appreciable.
The commands were pretty simple to implement through the qiime plugin. After using greengenes2 for taxonomic annotation, compared to the previous Silva-RESCRIPt workflow, ended up with far fewer ASVs, do you know why this might be and whether this could be a problem? I am working with fish gut microbiomes.
Another issue that I have is that following gg2 annotation, all the annotated ASVs are automatically renamed. While this is fine, it means I lose the representative sequences for each ASV and I was wondering if there was a way of conserving this link between annotated ASV and its representative sequence.
It seems surprising that many ASVs are dropping. Are the per-sample read counts dropping much? I ask because in other scenarios, I've seen the number of ASVs drop a fair amount however the retained sequence data is still like > 99%, suggesting most lost were singletons.
The mapping for non-v4-16s is a known issue. The underlying plugin, q2-vsearch, does not emit the mapping information so that the moment we cannot readily expose that. We've recently opened an issue about it, and the seed of that discussion can be found here (Introducing Greengenes2 2022.10 - #34 by wasade).
I need to check the per-sample read checks but I have some (hopefully) quick questions about how the Greengenes2 annotation works. As I understand it, the annotation works on a 99% sequence similarity. So, my query 16S V4 sequences will be assigned to a certain taxa if they are within 99% nucleotide sequence similarity to that taxa in the database, is that correct? So are sequences that fall outside this 99% similarity then excluded from the annotated data set? Does this mean that the annotated sequences are strictly OTUs rather than ASVs and thus several ASVs from the unannotated sequence list could be combined under a single OTU in the resulting annotated dataset?
I noticed that the Firmicutes are now spit into two phyla Firmicutes "A" and Firmicutes "D". Is this widely accpeted as taxonomic nomenclatures? I have not seen this naming convention elsewhere.
The the primary splits (e.g., _A) are derived from GTDB, see their note. The secondary splits (e.g., _1234) are based on the de novo Greengenes2 phylogeny. Greengenes2 expresses a phylogenetically supported taxonomy.
Sorry to bother you again but do you have an answer to my previous question as to whether the annotated sequences are strictly OTUs rather than ASVs and thus whether several ASVs from the unannotated sequence list could be combined under a single OTU in the resulting annotated dataset?