Dear Qiimers!
I am happy to play with a large metagenomic dataset (> 400 samples).
I am facing some difficulties running it, mostly due to the large size of imported reads (1.6 TB). Am I safe to assume that splitting this dataset into smaller chunks for taxonomy annotation with kraken2 and merging the following taxonomy files and feature tables will be equivalent to running all samples together?
PS: For devs: Did you consider as an alternative importing each pair of metagenomic reads as a separate .qza file to a separate directory instead of pooling all samples together to decrease the amount of memory and storage required to handle the data?
I cant speak to others but I typically run a smaller number of metagenomic samples in parallel. Most of the big pipelines (metaphlan, kraken2) can be run a single sample at a time, so there should be no issue in splitting them up.
I'm not a huge fan of the idea of having each sample processed, but I think maybe some kind of batching (sequencing plate? sequencing run?) would be a more effecient way to go.
Thank you for the response!
Yes, usually outside of Qiime2 I run my samples in parallel. Just wanted to make sure that I am not messing up by doing the same in Qiime2!
This is an assumption based on how the tool runs outside of QIIME 2!
Plus, we batch and merge with DADA2 and Deblur all the time!
I guess the same caveats apply: make sure the parameters are consistent across the runs, and dont do things that depend on the internal data separately. (Like, it may matter a lot more in co-assembly, but I think that's done on a per-gradient basis and wont be a concern here. IDK. I aspire to assembly.)