Confusion on when to agglomerate (or if it is needed) before plotting?

kida_miska · March 21, 2023, 7:50pm

Hi! I'm so sorry this is long but I don't know what's relevant for you to know and what isn't.

I'm sorry if this is a dumb question but I am very lost and hoping for some guidance from more experienced bioinformatics folks. I'm doing my first microbiome 16s Amplicon data analysis from start to finish - I've previously only worked in the wet lab side. My primary question is when to agglomerate with respect to handling my NAs.

My second question is broader: I'm struggling to get a sense of direction on the order of steps of how to appropriately wrangle my 16s data in R using the following packages: phyloseq (deseq2 isn't available for my version of R so haven't used it), vegan, microbiome, and microViz. I have coding experience with ggplot2, dplyr, tidyverse, plotly type pf packages (data viz) which I was hoping to bring into the alpha diversity, relative abundance, beta diversity plots, heat maps, and any relevant plots after ANCOM and mixed models analysis.

Background on data: I have successfully performed all the denoising in Qiime2 with deblur, generated an OTU feature table (that I did not turn into relative abundance prior to exporting for R because I figured it was better to clean up my data in R and then transform to relative abundance from absolute counts), generated my .qza files for taxonomy and my phylogenetic tree, and imported everything along with my metadata into Rstudio. (I manually verified that the order of my samples in my metadata matched the excel files for my OTU table columns before importing).

I shifted my OTU column into rows for my OTU table and my taxonomy table, made each into a matrix, and created a phyloseq object that includes OTU, Tax, metadata, and tree. I took that initial object, and created two objects from it. The first has dropped all my unclassified data in each rank level (i.e. k__; s__, na, unknowns). The second object I used tax_fix to rename anything unknown based on the next closest known classification but not drop them because I would like to see if there's anything worth noting in the unclassified organisms between my control group's samples and my treatment group's samples.

HERE'S where I'm lost! Now that I've made these two objects, I don't know what the next appropriate step is or if it doesn't matter what order I do things? Should I convert my absolute counts to relative abundance now, then agglomerate at each rank level, then begin plotting alpha diversity/beta diversity/heatmaps/visual of phylogenetic tree at different rank levels? Should NAs be dropped for these to be accurate visualizations? Should I subset my data by rank level or is agglomerate equivalent/better?

Other than plotting the relative abundances at each rank level to show compositional shifts between my variables of interest and groups, should I be using the relative abundances for all the other plots/analysis or go back to absolute counts for heat maps/alpha/beta metrics? Is there a separate step to calculate the alpha diversity and beta diversity and THEN plug the data into the code for the respective plots? or does the plotting code do the calculations in the background? Some examples go straight from merging the initial object to plotting the alpha diversity and others have this interim step of calculating indices which is separate from visualization??

All the workflows are different from one another, and many do not explain the in-between steps but jump between the code for the plots and don't specify the characteristics my data needs (absolute or relative, low reads trimmed off or kept in, use object with all the hierarchies or must filter out data so only one rank is in the object before doing it).

I realize I've packed a LOT of questions in here (might indicate how lost I am currently). If you're able to help with even just one of these questions I'd be grateful! Ultimately, I'm stuck after the step of "dealing" with my NA's and need to produce the standard visualizations I've listed and diversity metrics. Do all plots need both agglomeration and pruning prior? Are these functions the same thing? do any of these need NAs or will they work with and without NAs?

I don't know where to go from here, thank you in advance for any guidance!! (and for reading all of this)

colinbrislawn · March 27, 2023, 8:30pm

Hello @kida_miska,

Welcome to the forums!

Yes, I can help answer some of these questions. But first, I want to commend you for diving into the informatics side of things. There is a lot to learn here, and it can feel overwhelming. But you are making great progress (full Qiime2 pipeline, imported into R, learned the Tidyverse, etc., etc.). You are also asking all the right questions.

It sounds like we do analysis in a similar way:
16S reads -> -> R (phyloseq and Tidyverse)

Now that you have run reads through Qiime2, it's the perfect time to review the detailed flowcharts on the Overview of Qiime2 page. These pages serve as a map, and you can trace the path you followed with your data in these flowcharts.

The big question: What to do with NAs?
It sounds like you are speaking about features (ASVs/OTUs) that have NA in their taxonomy identification, is that right?
p__Firmicutes, c__Bacilli, o__Lactobacillales, f__?

This is a feature only classified down to the Order level, i.e. the Family is unclassified. We might also call this unclassified at the family level, or underclassified.

(I am avoiding the term 'NA' because that is more commonly used to describe unobserved features, zeros in your data that may not be truly zero! This is 'under-sampling' and is related to sparsity in statistics. That's a different issue!)

When I work with unclassified features, I try to preserve as many levels as possible for as long as possible. When I build a stacked bar plot at the phylum level, I sum up the counts for that one plot just before making it. (example code) Then the rest of the analysis continues with the unmerged ASVs.

This reduces database bias: if you merge or drop features based on taxonomy, anything missed by the database will be missed by your stat tests. I recommend doing as much analysis as possible on the feature level (as if you didn't have tax labels at all!).

A similar strategy is helpful for normalization. People seem to like stacked bar plots that go to 100%, but deseq2 and ancom-bc do their own normalization. So I only rescale when it's needed for a specific graph, and keep my primary source data mostly raw.

I hope this helps you keep moving forward.

As you have noticed, multi-part questions tend to take longer for someone to answer, while specific questions get fast answers.

Taylor_Akers · January 31, 2024, 4:44pm

This was very helpful for me, as I am going through the same thought process for the first time. Thank you!