My labmate, undergraduate research assistant, and I have spent a significant amount of time searching the forum for a direct answer to our question and haven’t had any luck. In our lab, we are currently using QIIME2/2019.4.
We use QIIME2 to analyze microbial data from a variety of clinical trials and in collaboration with other labs. As such, we often only assign taxonomy with relative abundances from a subset of samples per plate/run, rather than for all samples. However, we were under the impression that it is a best practice to avoid obtaining relative abundances for all merged samples and then manually omitting samples from the .tsv file in Excel. However, we are confused, as we are aware that the relative abundance’s compositional nature means that these abundances should matter within a participant, but not between participants (i.e. if we add up the relative abundances across a participant, it will equal 1).
So our question is: What is the proper thing to do? Can we obtain relative abundances for all merged runs through the workflow once to get the assigned taxonomy/relative abundances and manually remove those we do not need from the .tsv file? Or should we filter/re-analyze our sequence data each time we want the assigned taxonomy/relative abundances for a subset of this entire sample?
From what we’ve found, we believe that subsampling a QIIME2 taxonomic run to a specific group isn’t going to make the numbers wrong per se (i.e. not add up to the correct amounts), but it would be better to re-run QIIME2 with just that subsample.
Based on our understanding, QIIME2 denoises and filters the input for chimeras, then picks ASVs, and then produces relative abundances which are basically the # of ASVs detected in a sample divided by total read count for that sample. But if we have a different set of inputs, then we might end up finding a different set of chimeras and ASVs. It is possible that some sequences in combined groups A and B are filtered as chimeras or classified as different ASVs, but not when group A and group B are considered alone - especially if the groups were sequenced differently (i.e. different runs or plates) and not randomly assigned. That’s not to say that the relative abundances of group A+B are wrong if we subsample to just group A, but if we were really only going to look at group A it would be better to do a QIIME2 analysis of just group A since that would remove any conflating effects or factors of group B on the analysis.
Some other posts we’ve found that support our thoughts are as follows:
Poster says that combining samples across runs should be run together as a unified batch instead of separately: https://forum.qiime2.org/t/best-way-to-merge-or-group-runs-samples/2855
Poster says that OTU table is different for same batch sample with different sequencing batches: https://www.biostars.org/p/432062/
Sequence counts varied across the groups, so OTU counts did as well: https://forum.qiime2.org/t/how-to-normalize-bacterial-abundance-for-unequal-sample-size/7358
QIIME2 internally performs normalization, it doesn’t say specifically if taxonomy is affected though: https://forum.qiime2.org/t/normalization-use/4708/2