It is a little unclear to me how to best utilize DADA2. According to the FMT tutorial, "the DADA2 denoising process is only applicable to a single sequencing run at a time, so we need to run this on a per sequencing run basis and then merge the results" for multiple sequencing runs. However, the forum topic DADA2 multiple runs with different number of samples talked about analyzing one sample at a time and a single sequencing run in DADA2.
I understand that I do not want to analyze more than one sequencing run at a time in DADA2, but if I have 32 samples in a single sequencing run, can I analyze each sample separately in DADA2 then merge all samples afterwards instead of analyzing all 32 DADA2 results afterwards to take advantage of my high-throughput sequencing cluster? Or would it not be recommended because of the way the DADA2 error model works?
Lets say you did one Miseq run containing 32 samples. As far as I understand, you run DADA2 ONCE, where DADA2 does the analysis per sample, and outputs a frequency table containing the information for all 32 samples. You do NOT need to specifically run dada2 32 times (once per sample). Hopefully I (correctly) answered your doubts.
Thanks for your quick response @ Richard_Rodrigues1! You are understanding my question correctly.
I understand that I can run DADA2 once for all 32 samples, but doing that would take a long time depending on the sequence quality. I want to speed up this process (potentially) by taking advantage of the high-throughput computer cluster by analyzing each sample in DADA2 separately. What I want to know is:
Is this recommended?
Is the error model for DADA2 different whether I run each of the 32 samples separately or run all 32 samples as one DADA2 analysis?
You can run all samples together (once) and I think dada2 parallelizes based on --p-n-threads arg. use the --p-n-threads 0 arg to use all cores. Will save you time than running dada2 separately on each sample.
So I know it will affect the --p-chimera-method. Not sure about the error model, but from what I read it seems dada2 uses the information from a single (full) run to get the error model. Running 32 samples together (or actually all samples in the miseq run) would make sense.
Again, please wait to hear from the qiime2 developers about these questions.
I am aware of those parameters, and have used those parameters, but DADA2 still takes a long time. Which is why I am wondering if there is a disadvantage to analyzing each sample separately in the DADA2 step then combining all samples in downstream analyses.
With only 32 samples, going through an HTC shouldn't take too long. What kind of wait-times are you experiencing?
This is technically already how DADA2 works. It makes the error model using all the samples from the same run, then uses that model to denoise each read separately and independently, this is why it is nicely parallelizable.
If you still really wanted to run DADA2 separately on each sample, it may be technically doable, not in the qiime2 version, possibly with DADA2 in R (not sure, you can check with the DADA2 forum), I just don't know if it makes any sense to do so. If your goal is to divide the workload into n cores (through different clusters), this would be roughly the same as just having --p-n-threads n. But as @Richard_Rodrigues1 mentioned (thanks for that by the way!) you wouldn't be able to properly use chimera removal methods since those do depend on the collective output.
Alternatively, check out DADA2 tutorial on dealing with big data, perhaps this is more what you are looking for?
7 days is certainly abnormally long for 32 samples, especially if running in parallel. On a PC running virtual machine, 8Gig memory and 6 cores dedicated I've completed ~80 samples (4-6 Mil reads) in ~ 12 hours. Of course this can be highly variable, but 7 days seems still pretty atypical. There may be something else causing this issue..How big are these files and how many sequences are you working with in those 32 samples and how long are the reads. What kind of machine were you running this on? CPU, memory, and # of dedicated cores to the task (i.e. --p-n-threads ?)
You can certainly try running it in R, though you'll probably want to make sure your R environment is set for parallel tasks. That link should hopefully explain these. Keep us posted!
We have a HTCondor computer cluster at our facility. Each of the 12 Linux nodes in our cluster has 38 CPUs, 4 TB memory, and 128 GB RAM. I usually specify 38 for --p-n-threads. It could very well be that maybe I am using our HTCondor cluster incorrectly or that maybe I should be using a high performance computer instead so I can increase --p-n-threads...
I'm working with between 70 million to over 250 million reads, so it makes sense to me that it would take a lot longer to analyze my data. This is why I am asking the question of whether analyzing sample vs sequencing run makes sense and if the error model is affected at all...
Makes sense — I think you may just need to wait this out.
I would not advise splitting. I can't say for sure what impacts this would have on the error model, but chimera filtering and singleton filtering would be impacted.