Large Multi-run Project

Mahasti · December 18, 2021, 4:40am

Hello everyone,

I run a project that consists of over 2,000 fecal samples. This project is ongoing, and the aim is to build a database whereby we can compare the microbial profile of a new sample to the existing average, and add the new sample to the database, and so on.

I currently analyze these samples with a protocol similar to the Fecal Microbiota Transplant (FMT) study tutorial in QIIME2. Each Illumina 16S sequencing run is processed separately, merged following filtering and denoising with DADA2, and then analyzed with core diversity metrics and taxonomic classification (statistics externalized to SAS and R). Right now, our external QIIME2 dedicated server can handle this dataset, however, analysis is becoming time consuming due to file size.

My question is how to build a microbiome database and what steps to take to properly analyze a project of this scale? Can QIIME2 handle something like this in the long run? Do I have to keep re-analyze the entire dataset with each additional run?

I would appreciate any advice or directions to resources to accomplish something like this.

Thank you in advance QIIME2 developers and community! I really appreciate all of your hard work!

Mehrbod_Estaki · December 18, 2021, 6:53am

Hi @Mahasti,
Based on everything you said, I think you should really check out Qiita if you haven't already.
There you can start a project and continuously add new samples to it as they come. In fact this is how the American Gut Project is ran and it currently has 33,656 in it. You can process and analyse right there online as well so no issue with resources. Qiita uses Deblur instead of DADA2 for denoising. For analysis, Qiita uses QIIME 2 under the hood so you'll be able to run all the core analysis you do with Q2 over there too. For analysis outside of Qiita (ex in SAS or R) you can just download the artifacts of interest (they will be Q2 artifacts) and carry on with your regular pipeline.

Some parts yes, some parts no. For example, you only need to process the new samples (ex DADA2 or Deblur) but if you wanted to do OTU clustering or calculate distance matrices for example then you would have to run everything with addition of new samples.

Mahasti · December 19, 2021, 6:22pm

Hi @Mehrbod_Estaki,

Thank you for your quick reply and for the information! I have never heard of Qiita before, so I will check that out!