Hi @Linnie,
This is a very interesting suggestion/question.
In general, if we are interested in separating feature-level information (e.g., unique sequence variants), we would continue analysis with the feature table. If we are interested in taxonomic-level information, we use taxa collapse
to collapse at that taxonomic level and continue analysis.
However, it sounds like you are interested in, e.g., collapsing into taxa but keeping taxa separate that have different greengenes IDs (essentially retain “strain”-level information) for plotting purposes. Is that correct?
One big problem with the approach that you describe is that many of the taxonomy classifiers are not simply finding the most similar greengenes ID, they are finding consensus among N top hits or determining the confidence of a given assignment based on its similarity to other potential hits. Hence, many taxonomy assignments are not species level (or whatever the maximum taxonomic level is in the reference database), and there will never be a greengenes ID associated with that particular assignment. So the main fault with this approach (and all methods I describe below) is that the greengenes ID is not a reliable taxonomic assignment.
But there are a few ways to approach this (but all require a bit of leg work).
A practical approach (if you are only interested in a specific taxonomic group — I assume you are not trying to split hairs on every single taxonomic group in your data) would be to filter to a specific taxon and then look at the feature-level information (instead of collapsing into taxonomic groups). You can do this by:
- filtering your feature table to only contain the taxa of interest
- use
feature-table heatmap
to view the abundance of those features (sequence variants) in each sample as a heatmap. Or export your data and use another method to view this feature-level information.
The main advantages of this approach are that you keep each feature separate (and hence recognize separate sequence variants in your analysis) but also use the actual taxonomy assignment, which would be more reliable than a top-hit taxonomy assignment.
If that doesn’t cut it (e.g., you are interested in many or all taxonomic groups), I can think of a couple approaches that you might like (the first is better and easier and by far my favorite approach of all, even the “practical” approach above, but requires programming skills; the second does not necessarily):
- You could modify the reference taxonomy to contain reference ID as an additional taxonomic level. This would require some degree of bash or other programming skills. Then you can assign taxonomy using that reference database — you will need to use
classify-consensus-blast
or classify-consensus-vsearch
with the maxaccepts
parameter set to 1 (to switch off consensus assignments and just retain the top hit). Then collapse on taxonomy and use heatmap, barplots, or any other methods to view and analyze.
- Since it sounds like you are sort of interested in finding the most similar hit in the greengenes database, you could use
q2-quality-control
's evaluate-seqs instead of a taxonomy assignment. Taxonomy assignment is not what that method is really intended for, but it does report the top blast hit (including reference database ID but not taxonomic information) for each query sequence. This would not be a QZA artifact so you would not be able to use it for downstream analyses in QIIME2, and only the database ID, not the taxonomy, would be listed, so you would need to dig around a bit to determine the actual taxonomy.
I hope one of these solutions help!