Need to rerun QIIME2 workflow for assigned taxonomy for subset of sample?

jwdebelius · May 20, 2020, 11:24pm

There are multiple issues here in terms of running samples. So, let's break down the steps in the bioinformatic workflow and talk about where samples interact and where they don't. (I'm going to assume that you're not dealing with technical replicates and therefore each sample in the table is unique. (If you're doing technical replicates, than the Best way to merge or group runs/samples is probably better advice).)

So, in your pipeline you probably have

Demultiplexing and primer removal
Denosing
Taxonomic assignment
Tree Building

I'm going to add 4. Tree building because I happen to think that's an important step in this process, too! We're going to ignore the upstream effects about how extraction kit, lab, etc has a large effect on the sample, and just focus on the bioinformatic problem at hand.

Demultiplexing

This is run dependent and done as a per-run unit, but you should assume that you can subset your data here independently. You should try and use a. consistent demux approach across your studies, if only because your sequencing center probably has an approach that they like. If you switch sequencing centers, you can re-evaluate. I would run this once, and only once, per sample. (Exception is if you discover something was totally wrong the first time. (For example, the barcode sheet got messed up, primers were flipped, etc.)

Pretty much all your downstream analyses will want the primers removed. You should do this early. Store the result so you don't need to reprocess.

Denoising

Deniosing defines the set of sequences (composition) of your sample and you get a table of counts and a representative set of sequences out. The denoising algorithm and parameters have a larger impact on the observed composition than on whether or not samples are run together, however, one algorithm is sequencing run dependent and one is independent.

Dada2

The DADA2 algorithm trains its error prediction/correction model on the sequences that are present in the the same and does chimera selection based on the ASVs that remain. So, it's more likely to be sensitive to sample/run to run variation. Best practice (AFAIK) for DADA2 is trim the primers and run the algorithm per sequencing batch because of the way it is trained. (Don't do any additional quality filtering or pre-processing.) If you use consistent batches across different runs, you should be able to combine the reads.

Deblur

The deblur algorithm uses an upper limit error model and if you apply the same parameters to the same sample, it will produce a consistent result whether. it's run alone or in combination. Chimera status is infered from an external database, so it's also less sensitive to other samples. The downsides to Deblur are that it tends to produce lower sequence counts than DADA2 because of the way the algorithm deals with errors, and programmatically, it requires a few more manual steps from you.

Taxonomy Annotation

For a given sequence and classifier, you should get the same taxonomic annotation whether the sequence is alone or with a bunch of other sequences.

Tree Building

Tree building is may be dependent on the other sequences involved, but I don't think it's been fully benchmarked. I would assume that it doesn't make a huge difference, and proceed from there. If you're interested in working on a benchmark, though, let me know and I'd love to talk further!

Export and Filtering

At this point, you have a set of samples that contain some sequences associated with some names and maybe a tree. The composition of the individual sample is independent and stable. Now, you want to work with a subset of that run. If you filter samples at this point, the composition of an individual sample won't change.

Okay, so as i come back to your approach,

...I disagree... with some caveats.

Within a single sequencing run for mixed projects A and B, it becomes computationally expensive and sightly obnoxious to have to re-process the data for every new subset of samples, particularly if you're denoising with deblur. It doesn't affect the taxonomic annotation. It probably doesn't effect the tree enough to justify the computational expense. And, it limits the need to re-run data every time you change your mind about the subset you want to work with.

If I have a plate that's project A and a plate that's project B, I would run them independently and not combine them at all. i might have a standard processing pipeline, or standard processing considerations, but I probably wouldn't do that. If I wanted to meta-analyze, I might re-process together if my original parameters weren't the same.

If I have a mixed set of samples from project A and project B across plate 1 and plate 2, i would denoise by sequencing run, and then merge the sequences and do a single, pooled taxonomic classification and build my tree.

There are a few places where the subsetting must come first. These will include

Data-driven feature-based filtering. For example, if you're removing a feature that has to be present in 10 samples in a combined table and it's only present in one sample in your combined table, that will change your filtered composition.
PCoA. Your between sample distances are metric, depth, and feature dependent, but they're stable. (The distance in miles from LA to San Diego depends on the route you take and maybe the car you drive, but they're not affected by the distance between San Diego and Santa Fe). However, the samples you choose to project in your PCoA may change their orientation in space.

Best,
Justine