Need to rerun QIIME2 workflow for assigned taxonomy for subset of sample?

nutrishinn · May 20, 2020, 9:33pm

My labmate, undergraduate research assistant, and I have spent a significant amount of time searching the forum for a direct answer to our question and haven’t had any luck. In our lab, we are currently using QIIME2/2019.4.

We use QIIME2 to analyze microbial data from a variety of clinical trials and in collaboration with other labs. As such, we often only assign taxonomy with relative abundances from a subset of samples per plate/run, rather than for all samples. However, we were under the impression that it is a best practice to avoid obtaining relative abundances for all merged samples and then manually omitting samples from the .tsv file in Excel. However, we are confused, as we are aware that the relative abundance’s compositional nature means that these abundances should matter within a participant, but not between participants (i.e. if we add up the relative abundances across a participant, it will equal 1).

So our question is: What is the proper thing to do? Can we obtain relative abundances for all merged runs through the workflow once to get the assigned taxonomy/relative abundances and manually remove those we do not need from the .tsv file? Or should we filter/re-analyze our sequence data each time we want the assigned taxonomy/relative abundances for a subset of this entire sample?

From what we’ve found, we believe that subsampling a QIIME2 taxonomic run to a specific group isn’t going to make the numbers wrong per se (i.e. not add up to the correct amounts), but it would be better to re-run QIIME2 with just that subsample.

Based on our understanding, QIIME2 denoises and filters the input for chimeras, then picks ASVs, and then produces relative abundances which are basically the # of ASVs detected in a sample divided by total read count for that sample. But if we have a different set of inputs, then we might end up finding a different set of chimeras and ASVs. It is possible that some sequences in combined groups A and B are filtered as chimeras or classified as different ASVs, but not when group A and group B are considered alone - especially if the groups were sequenced differently (i.e. different runs or plates) and not randomly assigned. That’s not to say that the relative abundances of group A+B are wrong if we subsample to just group A, but if we were really only going to look at group A it would be better to do a QIIME2 analysis of just group A since that would remove any conflating effects or factors of group B on the analysis.

Some other posts we've found that support our thoughts are as follows:

Poster says that combining samples across runs should be run together as a unified batch instead of separately: https://forum.qiime2.org/t/best-way-to-merge-or-group-runs-samples/2855
Poster says that OTU table is different for same batch sample with different sequencing batches: https://www.biostars.org/p/432062/
Sequence counts varied across the groups, so OTU counts did as well: https://forum.qiime2.org/t/how-to-normalize-bacterial-abundance-for-unequal-sample-size/7358
QIIME2 internally performs normalization, it doesn't say specifically if taxonomy is affected though: https://forum.qiime2.org/t/normalization-use/4708/2

jwdebelius · May 20, 2020, 11:24pm

Hi @nutrishinn,

There are multiple issues here in terms of running samples. So, let's break down the steps in the bioinformatic workflow and talk about where samples interact and where they don't. (I'm going to assume that you're not dealing with technical replicates and therefore each sample in the table is unique. (If you're doing technical replicates, than the Best way to merge or group runs/samples is probably better advice).)

So, in your pipeline you probably have

Demultiplexing and primer removal
Denosing
Taxonomic assignment
Tree Building

I'm going to add 4. Tree building because I happen to think that's an important step in this process, too! We're going to ignore the upstream effects about how extraction kit, lab, etc has a large effect on the sample, and just focus on the bioinformatic problem at hand.

Demultiplexing

This is run dependent and done as a per-run unit, but you should assume that you can subset your data here independently. You should try and use a. consistent demux approach across your studies, if only because your sequencing center probably has an approach that they like. If you switch sequencing centers, you can re-evaluate. I would run this once, and only once, per sample. (Exception is if you discover something was totally wrong the first time. (For example, the barcode sheet got messed up, primers were flipped, etc.)

Pretty much all your downstream analyses will want the primers removed. You should do this early. Store the result so you don't need to reprocess.

Denoising

Deniosing defines the set of sequences (composition) of your sample and you get a table of counts and a representative set of sequences out. The denoising algorithm and parameters have a larger impact on the observed composition than on whether or not samples are run together, however, one algorithm is sequencing run dependent and one is independent.

Dada2

The DADA2 algorithm trains its error prediction/correction model on the sequences that are present in the the same and does chimera selection based on the ASVs that remain. So, it's more likely to be sensitive to sample/run to run variation. Best practice (AFAIK) for DADA2 is trim the primers and run the algorithm per sequencing batch because of the way it is trained. (Don't do any additional quality filtering or pre-processing.) If you use consistent batches across different runs, you should be able to combine the reads.

Deblur

The deblur algorithm uses an upper limit error model and if you apply the same parameters to the same sample, it will produce a consistent result whether. it's run alone or in combination. Chimera status is infered from an external database, so it's also less sensitive to other samples. The downsides to Deblur are that it tends to produce lower sequence counts than DADA2 because of the way the algorithm deals with errors, and programmatically, it requires a few more manual steps from you.

Taxonomy Annotation

For a given sequence and classifier, you should get the same taxonomic annotation whether the sequence is alone or with a bunch of other sequences.

Tree Building

Tree building is may be dependent on the other sequences involved, but I don't think it's been fully benchmarked. I would assume that it doesn't make a huge difference, and proceed from there. If you're interested in working on a benchmark, though, let me know and I'd love to talk further!

Export and Filtering

At this point, you have a set of samples that contain some sequences associated with some names and maybe a tree. The composition of the individual sample is independent and stable. Now, you want to work with a subset of that run. If you filter samples at this point, the composition of an individual sample won't change.

Okay, so as i come back to your approach,

...I disagree... with some caveats.

Within a single sequencing run for mixed projects A and B, it becomes computationally expensive and sightly obnoxious to have to re-process the data for every new subset of samples, particularly if you're denoising with deblur. It doesn't affect the taxonomic annotation. It probably doesn't effect the tree enough to justify the computational expense. And, it limits the need to re-run data every time you change your mind about the subset you want to work with.

If I have a plate that's project A and a plate that's project B, I would run them independently and not combine them at all. i might have a standard processing pipeline, or standard processing considerations, but I probably wouldn't do that. If I wanted to meta-analyze, I might re-process together if my original parameters weren't the same.

If I have a mixed set of samples from project A and project B across plate 1 and plate 2, i would denoise by sequencing run, and then merge the sequences and do a single, pooled taxonomic classification and build my tree.

There are a few places where the subsetting must come first. These will include

Data-driven feature-based filtering. For example, if you're removing a feature that has to be present in 10 samples in a combined table and it's only present in one sample in your combined table, that will change your filtered composition.
PCoA. Your between sample distances are metric, depth, and feature dependent, but they're stable. (The distance in miles from LA to San Diego depends on the route you take and maybe the car you drive, but they're not affected by the distance between San Diego and Santa Fe). However, the samples you choose to project in your PCoA may change their orientation in space.

Best,
Justine

nutrishinn · May 21, 2020, 9:29pm

Thanks so much for the detailed clarification @jwdebelius! This really clears things up and makes perfect sense, thanks again!

Leila

system · June 22, 2020, 3:29am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.