Should I utilize sequences that are mapped to the greengeens2 backbone (using the non-v4-16s function) as the input sequences for a Naive Bayes classifier? This classifier is trained on the V3-V4 regions of greengenes2.
Though I've been able to acquire taxonomic information using unmapped sequences, it seems that there's no method to obtain corresponding (with matching ASV names) phylogenetic information.
You could just use the taxonomy of the backbone records. We didn't place V3-V4 ASVs in the phylogeny so there won't be existing coordinates for them. We are working on a way to do the placement for arbitrary fragments but it isn't available just yet.
For ASVs not based on 515f-806r, I would recommend using right now using the non-v4-16s action which performs closed reference OTU picking against the backbone. This would allow use of the phylogeny
First off, big thanks to everyone for your input, especially Daniel.
I've been diving into the analysis of my V3-V4 region with Greengenes2 and stumbled upon some divergent results depending on whether I classify directly using qiime greengenes2 filter-features or utilize the pre-trained classifier.
RESULT: 26618 entries (taxonomy.qzv). This outcome aligns more closely with results I get when using the SILVA database.
How can there be such a discrepancy when we're supposedly employing the same database for taxonomic classification? Is there something I'm misunderstanding?
What's even more curious is that when I generate relative frequency tables at level 7, I get 2616 entries in the first case, compared to 1908 in the second case.
Thanks for the kind words, and thanks for reaching out!
In the non-v4-16s case, it's performing closed reference recruitment with q2-vsearch behind the scenes. Different ASVs could recruit to the same backbone, and some ASVs may not recruit at all. In the classify-sklearn example, the ASVs themselves are being classified, where similar (but different) ASVs may each get a label, and it can potentially characterize ASVs which are too divergent to recruit to the backbone.
I am also looking for a similar thing using Kraken2. For now I have extracted the V3V4 region from the greengene2 sequences and tried to build the database using Kraken2. Though database building is successful, none of the sequences are classified. Any thoughts on this? Did you face similar issue?
Does the comment here resolve your questions? I see the issue opened on the Github tracker as well -- if there are further questions unrelated to QIIME 2, it would help to have them directed to Github.
Hello again. Sorry for a new question on the same topic, but I just want to be 100% sure of doing the things right! Is the following code correct to train a classifier on v3-v4 paired end reads? Or do I have to use the pre-trained classifier on the full-length data?
And also: if using the non-v4-16s action, can i give as input to qiime greengenes2 non-v4-16s the table and sequences obtained from qiime dada2 denoise-paired? Would this method be preferable or not?
Thank you again for you time and kind suggestions!
I'm not familiar with the --p-min-length and --p-max-length options, but from looking at the help text, I probably would relax them further. The coordinates for your primers are probably relative to E coli and it is plausible there may be variation in length in that region, but that is just a guess. That said, on the surface the commands seem reasonable. The commands used to construct the V4 classifier can be found here.
The non-v4-16s action is a thin wrapper around q2-vsearch's cluster closed reference action -- providing stitched reads from DADA2 would work
Hello @wasade
Thank you for the GG2 release last year.
I have used the pretrained 515F-806R classifier for 16S-V4 amplicons from lichen samples. The primers target prokaryotic organisms in this microhabitate, including 16S regions of the chloroplast genome.
The silva 138.1 classifiers reports these features as (only one example given):
f577632a80f935428b6c9117d8075eb3 d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Chloroplast; f__Chloroplast; g__Chloroplast
while the gg2 classifier reports it as (same feature as above):
f577632a80f935428b6c9117d8075eb3 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__Cyanobacteriales
The overall classification into the p__Cyanobacteria phylum is very similar with the two classifiers (76 features using silva, 72 feature using gg2).
As the majority of these features are classified as g__Chloroplast using silva classifier, we assume that they originate from algal plastids present in these samples (there is always a very dominating feature originating most likely from the main algae partner of the lichens).
How do I have to interpret the gg2 taxonomy of these features? I have not seen any 'chloroplast' annotation in gg2 taxonomy. Are these sequences not present, or were assigned using a more sophisticated taxonomy rule or naming scheme?
We included the SILVA set of chloroplast sequences in Greengenes2 during DEPP placement, but not on the topology update step with uDance out of concern the sequences (which differ appreciably from bacterial/archaea 16S) would have detrimental effects. The taxonomy decoration phase was based off the records in the backbone topology, and those used for topology updates, but not the records used for placement. As a result, the taxonomy decoration did not explicitly include the records sourced from SILVA leading to a deficient taxonomy for chloroplast. I'm actively working on an update for this, and I apologize for any inconvenience.
Hi,
When I processed greengenes2 with output of Woltka, since the feature id is like "G000005825", I got error that there is no match labels for this format. How can I use the output of woltka in greengenes2?
Hi @wasade
Thank you for working on an update (sorry for the delayed reply, I was out of country).
Will you post a short notice here if you are done with it?
Best,