what is the best application strategy for analyzing large scale dataset

Hi QIIME2 team,

I am planning to analyze a large-scale dataset with over 10,000 samples, all of which are paired-end 16S data from different sequencing batches. The sequencing quality varies between batches, with some showing higher quality and others lower, leading to differences in read length after quality control. I would appreciate your suggestions on how best to approach this analysis using QIIME2 with dada2 or deblur denoise.

Thank you!

Hi @Brandon,

My recommendation would be to find the 5 batches with the lowest quality reads and set your trimming parameters based on those. I would then apply those same trimming parameters across every batch you have. If you dont have 515F-806R 2x150 primer, youre confident in the chain of custody/prior processing of your data, and you have the compute to burn, I'd probably use DADA2, TBH. I think you'll be happier with the results.

The reason you optimize for the worst data is because your ability ot have consistent ASVs is based on having consistent processing parameters. So, if you trim one batch to 100nt and one batch to 150nt, the ASV sequences will be different and you wont be able to combine them directly.

Good luck, and enjoy your big data set!

Best,
Justine

2 Likes

Hi @jwdebelius,

Thank you for your suggestions; they’re quite reasonable and helpful! I have a few follow-up questions and would love to hear more of your thoughts.

  1. If I apply the same trimming cutoff across all batches, some higher-quality batches might be fine. However, for lower-quality batches, I might need to trim down to 100 bp, while to maintain consistency with the higher-quality ones, I’d ideally trim to 140 bp. Will the results still be reliable in this case? Or would it be safer to keep all read lengths at 100 bp?
  2. DADA2 denoises based on self-training of the reads. If I divide the dataset into 5 batches, could this lead to batch effects due to the data processing?
  3. Regarding your suggestion to split 10k samples into 5 batches, does this imply that 2k samples is the computational limit?

Thank you very much!

Hi @Brandon,

If you want the ASVs to be consistent across all your batches, you need to use the same trimming parameters across all the batches. Otherwise, you will get different ASVs. For example, think about a sequence, which is 36 characters

Sphinx of black quartz, judge my vow

In data set 1 I'm going to trim to 22 characters:

Sphinx of black quartz

In the second, I'll trim to 32

Sphinx of black quartz, judge my

You can maybe compare the two batches by eye and say the two look similar.

>batch 1
Sphinx of black quartz
>batch 2
Sphinx of black quartz, judge my

But, we dont typically have sentences you can compare easily, we have 100 - 140nt sequences. Oftne, we md5 hash them make them more readable. If I do that, I get

Original Hash
Sphinx of black quartz a52ba90095471cc6b2a8c989cb0c3a1a
Sphinx of black quartz, judge my 81eec2b8bc66b29cdffac927ab267002

And now we have super different has sequences and we're identifying super different features! The data sets may play nicely in phylogenetic metrics (unweighted UniFrac, weighted UniFrac) but you wont be able to use ASV level beta diveristy for jaccard distance or bray curtis dissimilarity or do feature-based analysis because different batches will have different feature names and they wont be compatable.

So, you either have to sacrifice your taxonomic resolution (length) in your better data or your read count (depth) in your worse data. Sometimes, you havet o balance the two.

Dada2 is designed to be run per sequencing run. I'm assuming you have sequencing run designations. There's a stochastic element to the way DADA 2 runs, but if you use the sequencing batch, its wrapped into a single confounding variable. I would not recommend mixing across sequencing runs because that will violate the assumptions of the algorithm and lead to worse performance.

Annecdotally, any issues are noise within the larger trash fire that is microbiome biases and not something Id worry about as much as say, trimming parameters; training on the wrong set of samples leading to bad error correction; or copious pre-processing that would violate the DADA2 assumptions.

That was not what I suggested. I said to look at the 5 batches with the worst sequencing quality.

I assume that your data is already batched technically based on a sequencing run designation. That's the unit DADA2 looks for and it wont hurt if you use deblur to manage them that way. If there are denoising effects, they're rolled into the sequencing effects, and then you only have one random intercept/adjustment you need to think about in modeling.

Computationally, this is both a pleasantly parallel problem, and one that begs for a high performance computer. I'd recommend contacting your local HPC group or renting one from someplace like EC2. Your processing would only require technical metadata (sample name; sample plate)

Best,
Justine

2 Likes

@jwdebelius Thank you!!!! :partying_face: Pretty helpful!!