what is the best application strategy for analyzing large scale dataset

jwdebelius · October 23, 2024, 6:25pm

If you want the ASVs to be consistent across all your batches, you need to use the same trimming parameters across all the batches. Otherwise, you will get different ASVs. For example, think about a sequence, which is 36 characters

Sphinx of black quartz, judge my vow

In data set 1 I'm going to trim to 22 characters:

Sphinx of black quartz

In the second, I'll trim to 32

Sphinx of black quartz, judge my

You can maybe compare the two batches by eye and say the two look similar.

>batch 1
Sphinx of black quartz
>batch 2
Sphinx of black quartz, judge my

But, we dont typically have sentences you can compare easily, we have 100 - 140nt sequences. Oftne, we md5 hash them make them more readable. If I do that, I get

Original	Hash
`Sphinx of black quartz`	a52ba90095471cc6b2a8c989cb0c3a1a
`Sphinx of black quartz, judge my`	81eec2b8bc66b29cdffac927ab267002

And now we have super different has sequences and we're identifying super different features! The data sets may play nicely in phylogenetic metrics (unweighted UniFrac, weighted UniFrac) but you wont be able to use ASV level beta diveristy for jaccard distance or bray curtis dissimilarity or do feature-based analysis because different batches will have different feature names and they wont be compatable.

So, you either have to sacrifice your taxonomic resolution (length) in your better data or your read count (depth) in your worse data. Sometimes, you havet o balance the two.

Dada2 is designed to be run per sequencing run. I'm assuming you have sequencing run designations. There's a stochastic element to the way DADA 2 runs, but if you use the sequencing batch, its wrapped into a single confounding variable. I would not recommend mixing across sequencing runs because that will violate the assumptions of the algorithm and lead to worse performance.

Annecdotally, any issues are noise within the larger trash fire that is microbiome biases and not something Id worry about as much as say, trimming parameters; training on the wrong set of samples leading to bad error correction; or copious pre-processing that would violate the DADA2 assumptions.

That was not what I suggested. I said to look at the 5 batches with the worst sequencing quality.

I assume that your data is already batched technically based on a sequencing run designation. That's the unit DADA2 looks for and it wont hurt if you use deblur to manage them that way. If there are denoising effects, they're rolled into the sequencing effects, and then you only have one random intercept/adjustment you need to think about in modeling.

Computationally, this is both a pleasantly parallel problem, and one that begs for a high performance computer. I'd recommend contacting your local HPC group or renting one from someplace like EC2. Your processing would only require technical metadata (sample name; sample plate)

Best,
Justine