Hello I have ran qiime picrust2 custom-tree-pipeline on my data using --p-hsp-method mp
My question is related to the pathway abundance and coverage tables. The coverage and abundance of each MetaCyc pathway is calculated how? It reports abundance and coverage over the whole community (predicted metegenome for each sample)? is it corrected by each predicted genome? if so, how? how does it takes into account the number of reactions in each pathway?
I got this from the FAQ section in Picrust2 wiki, however I find it a bit confusing. I don get what "the group being assessed" stands for.
"The coverage is based on the harmonic mean of confidence scores for each reaction in a pathway. The scores will differ depending on what the median reaction abundance is for the group being assessed (i.e. different reaction scores will be inferred if the reaction is within one predicted genome compared to across an entire sample). Since the median reaction abundance will be higher across an entire sample it’s harder to be confident in rare reactions within a sample based on this approach."
Thanks very much in advance for any help.
Pathway abundances and coverages are calculated using the same approach used by HUMAnN2. Predicted EC numbers are first regrouped to be MetaCyc reactions, which can be linked to MetaCyc pathways.
The pathways that are present are identified by first running MinPath to identify the minimum pathways present to explain the reactions. The abundances of these pathways are essentially calculated by taking the harmonic mean of the MetaCyc reactions within the pathway.
I say essentially because it’s a little more complicated (see explanation copied from HUMAnN2 wiki):
…the abundance for each pathway is a recursive computation of abundances of sub-pathways with paths resolved to abundances based on the relationships and abundances of the reactions contained in each. Each path, the smallest portion of a pathway or sub-pathway which can’t be broken down into sub-pathways, has an abundance that is the max or harmonic mean of the reaction abundances depending on the relationships of these reactions. Optional reactions are only added to the overall abundance if their abundance is greater than the harmonic mean of the required reactions.
Pathway coverages are calculated using the same approach except that the reaction abundances are transformed into reaction confidence scores.
In the QIIME2 version of PICRUSt2 the pathway abundances and coverages are based on the whole metagenome per sample. There is no correction for individual contributing sequences. Note that you can get the breakdown by sequence with the standalone version of PICRUSt2 as well, but not with the QIIME2 version currently.
Thank you very much for your quick reply.
And thanks for developing the plugin.
Good to know that there is choice for correction with the standalone version.
I just ran picrust2 in QIIME2, and I am also confused about matching the names of the MetaCyc pathway to the actual metabolisms… many of these pathways are numbers and I cannot find the mapping file for the metabolisms that they correspond to. I ran the default full pipeline, and I am looking at the exported (.tsv) “pathway_abundance” file.
Also, I am very interested in obtaining per-ASV predicted metagenomes (stratified by which sequence is contributing each function). I know that this is currently not available in the QIIME2 version of this plugin, but it would be really, really helpful if it could be added in the future.
I know this is over 3 years since this post was active and I am late in joining the discussion. But can someone please help me out with Brooke's question posted above? I am struggling at the same step too. I have some downstream analyses done on the pathways_abun_exported.tsv file that we get from q2-PICRUSt2 pipeline, however I am having a hard time in getting the complete names of the pathways that are in the output file. Where can I get the complete names of the pathways that come up in the output of q2-PICRUSt2 output? I'd greatly appreciate any leads to move ahead with this.
The map of pathway ids to full names can be found here: picrust2/metacyc_pathways_info.txt.gz at master · picrust/picrust2 · GitHub
You can also try use the
add_descriptions.py script, which would add the descriptions as an extra column to your file, although it's expecting files to be in the format outputted by the PICRUSt2 standalone version: Add descriptions · picrust/picrust2 Wiki · GitHub
Hopefully that helps!
Thank you for your response with the links. They were certainly helpful.
A couple of follow-up questions:
- How often is the text file on the GitHub page updated with new pathways? In other words, would it be fine if I used the same text file for a future analysis on another project?
- Can the add_descriptions.py script be used in the q2-PICRUSt2 plugin as well?
Thank you very much,
Regarding both questions: I'm no longer working on PICRUSt2 and so I can't promise to make updates like that unfortunately due to time constraints. Sorry!
More specifically though:
It is not regularly updated, so it matches the pathway ids used in the default database, but if updated mapping files are used with new and updated MetaCyc pathways then there could be mismatches / missing info.
Currently the q2-PICRUSt2 plugin outputs BIOM files, which I know from experience are a little buggy when you add metadata to them (or at least they were a few years ago!). That's why I avoided merging metadata into these files.
Great! Thanks for your response.
Another clarification I had on my mind - the numbers that appear in the outpit files of q2-PICRUSt2: what do they represent? What do I infer from those numbers? Can they be understood as quantified gene expression/pathway expression in each individual sample?
They represent the community-wide abundances of the genes/pathways in each sample. This is calculated as the abundance of each taxon (best on 16S - normalized by the predicted # of 16S copies per taxon) multiplied by the predicted gene/pathway copy number (and then this product is summed across all taxa). This means you would need to account for the difference in total sum across samples (e.g., you as a tool that can deal with compositional data or alternatively convert the abundances per sample to relative abundances).