Greengenes IDs in Feature Table

Linnie · January 26, 2018, 4:24pm

Hi Qiime2 team!

I am wondering if there is any way in which to add the Greengenes IDs to a FeatureTable.

Imagine that we run a QIIME analysis on the microbiome of an organism across developmental stages and then we run a separate analysis on its microbiome when feeding on different diets. In the first analysis, we might get a feature with a particular taxonomy. In the second analysis, we might also get a feature with the same taxonomy. However, they might be different features as multiple features have the same taxonomy in the feature table. So, if we just used the feature table we would consider them the same, but if we had the Greengenes ID we would know that they are different.

In the table that we get after producing the taxa plots, all features with the same taxonomy assignment are clustered into one. Is there any way to keep them separated (as they would be in the table.qza) and assign them a Greengenes ID? We would end up having a table with the ASV´s, abundance data, Greengenes ID´s and maybe also taxonomy.

Thank you so much for your help!

Nicholas_Bokulich · January 26, 2018, 5:10pm

Hi @Linnie,
This is a very interesting suggestion/question.

In general, if we are interested in separating feature-level information (e.g., unique sequence variants), we would continue analysis with the feature table. If we are interested in taxonomic-level information, we use taxa collapse to collapse at that taxonomic level and continue analysis.

However, it sounds like you are interested in, e.g., collapsing into taxa but keeping taxa separate that have different greengenes IDs (essentially retain "strain"-level information) for plotting purposes. Is that correct?

One big problem with the approach that you describe is that many of the taxonomy classifiers are not simply finding the most similar greengenes ID, they are finding consensus among N top hits or determining the confidence of a given assignment based on its similarity to other potential hits. Hence, many taxonomy assignments are not species level (or whatever the maximum taxonomic level is in the reference database), and there will never be a greengenes ID associated with that particular assignment. So the main fault with this approach (and all methods I describe below) is that the greengenes ID is not a reliable taxonomic assignment.

But there are a few ways to approach this (but all require a bit of leg work).

A practical approach (if you are only interested in a specific taxonomic group — I assume you are not trying to split hairs on every single taxonomic group in your data) would be to filter to a specific taxon and then look at the feature-level information (instead of collapsing into taxonomic groups). You can do this by:

filtering your feature table to only contain the taxa of interest
use feature-table heatmap to view the abundance of those features (sequence variants) in each sample as a heatmap. Or export your data and use another method to view this feature-level information.

The main advantages of this approach are that you keep each feature separate (and hence recognize separate sequence variants in your analysis) but also use the actual taxonomy assignment, which would be more reliable than a top-hit taxonomy assignment.

If that doesn't cut it (e.g., you are interested in many or all taxonomic groups), I can think of a couple approaches that you might like (the first is better and easier and by far my favorite approach of all, even the "practical" approach above, but requires programming skills; the second does not necessarily):

You could modify the reference taxonomy to contain reference ID as an additional taxonomic level. This would require some degree of bash or other programming skills. Then you can assign taxonomy using that reference database — you will need to use classify-consensus-blast or classify-consensus-vsearch with the maxaccepts parameter set to 1 (to switch off consensus assignments and just retain the top hit). Then collapse on taxonomy and use heatmap, barplots, or any other methods to view and analyze.
Since it sounds like you are sort of interested in finding the most similar hit in the greengenes database, you could use q2-quality-control's evaluate-seqs instead of a taxonomy assignment. Taxonomy assignment is not what that method is really intended for, but it does report the top blast hit (including reference database ID but not taxonomic information) for each query sequence. This would not be a QZA artifact so you would not be able to use it for downstream analyses in QIIME2, and only the database ID, not the taxonomy, would be listed, so you would need to dig around a bit to determine the actual taxonomy.

I hope one of these solutions help!

Linnie · January 26, 2018, 7:34pm

Hi @Nicholas_Bokulich,

thank you so, so much for your thorough answer! That is exactly what I needed.

I will give your suggestions a try.

Best,

L.

system · February 27, 2018, 1:34am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.