Making tables for 1000+ samples

A general question about strategy and approach. I am interested in particular in making tables for samples from both short read shotgun metagenomics and 16S rRNA sequencing experiments, and comparing them. After much research, it seems qiime2 and greengenes2 can enable me to make such tables. Some of the short read shotgun metagenomics data that I have consist of over 1000 samples. I observe that already for other data sets with a few dozen samples, quiime2 (dada2 step) seems to take a very long time. So I am wondering if making tables for short read shotgun is even feasible? Or rather, it must be feasible because in the corresponding publications they already did that... but I cannot use those published tables because they were not done against greengenes2... are there any tricks, advice, general approaches when trying to make table for studies consisting of 1000+ samples, short gun reads? Worried that calculating them will take an absurd amount of time. I am using a HPC cluster infrastructure, RAM and CPUs are available.

P.S.: By "making tables" I mean, starting from fastq, make tables where the rows are species and the columns are samples (patients). This is about human microbiome (gut, saliva...)

P.P.S.: Where I'm coming from: I have collected fastq for several studies, some shotgun, some 16S. I have experience in RNA-seq but am new to microbiomics.

Hello again,

This is a great question! Let's start here.

Welcome! I'm glad you found the forums. :qiime2:

The workflow for amplicons is very similar to RNA-seq, with just a few key differences. Because you already know the shape of this field, I think you are ready for the Qiime2 overview which covers both how Qiime2 works (compared to typical Linux programs)
and the data flow for amplicon analysis.
https://docs.qiime2.org/2023.9/tutorials/overview/

Keep separate things separate, and merge only when needed.

For example, process all your shotgun data separately from your amplicon data.
Also, process each Illumina sequencing run separately.

This is what dada2 recommends for quality, and you can process your data in parallel.

And finally: keep posting! We are here to help.

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.