Large metagenomic dataset

timanix · December 6, 2023, 8:17am

Dear Qiimers!
I am happy to play with a large metagenomic dataset (> 400 samples).
I am facing some difficulties running it, mostly due to the large size of imported reads (1.6 TB). Am I safe to assume that splitting this dataset into smaller chunks for taxonomy annotation with kraken2 and merging the following taxonomy files and feature tables will be equivalent to running all samples together?

PS: For devs: Did you consider as an alternative importing each pair of metagenomic reads as a separate .qza file to a separate directory instead of pooling all samples together to decrease the amount of memory and storage required to handle the data?

jwdebelius · December 6, 2023, 3:19pm

Hi @timanix,

I cant speak to others but I typically run a smaller number of metagenomic samples in parallel. Most of the big pipelines (metaphlan, kraken2) can be run a single sample at a time, so there should be no issue in splitting them up.

I'm not a huge fan of the idea of having each sample processed, but I think maybe some kind of batching (sequencing plate? sequencing run?) would be a more effecient way to go.

I'll leave this for other devs to answer though!

Best,
Justine

timanix · December 6, 2023, 3:32pm

Thank you for the response!
Yes, usually outside of Qiime2 I run my samples in parallel. Just wanted to make sure that I am not messing up by doing the same in Qiime2!

jwdebelius · December 6, 2023, 5:56pm

This is an assumption based on how the tool runs outside of QIIME 2!
Plus, we batch and merge with DADA2 and Deblur all the time!
I guess the same caveats apply: make sure the parameters are consistent across the runs, and dont do things that depend on the internal data separately. (Like, it may matter a lot more in co-assembly, but I think that's done on a per-gradient basis and wont be a concern here. IDK. I aspire to assembly.)

gregcaporaso · December 6, 2023, 7:56pm

We also have several optimizations that will help with very large sequence datasets like this, coming soon. These include:

support for doing less validation of fastq.gz files when importing, which should drastically reduce the runtime of importing;
support for importing directly to an artifact-cache, which allows you to bypass zipping the .qza;
parallel support in classify-kraken2, as well as some other actions, that will take care of the splitting and then joining for you.

Stay tuned!

timanix · December 7, 2023, 8:43am

That sounds like a significant improvement!
Good to know, will wait for new release

system · January 7, 2024, 2:43pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.