Detriments of running samples in batches

Hello all,

I am in the process of running 25 fairly large samples, and unfortunately my computer system is unable to complete the "qiime phylogeny align-to-tree-mafft-fasttree" step. It appears to be that the computer runs out of memory while creating the distance matrix.

As a result of this discovery, I thought that I could run the samples in batches. Of the 25 samples, batches of 5 were collected from the same sites, meaning 5 sites in total were used. I now know that my computer is able to complete all the steps desired on 10 of the 25 samples, and for context, I am starting with importing my data from the .fastq format all the way to a taxa bar plot. Essentially my question is what are the detriments to running samples in batches such as this?

1 Like

Hello and Welcome to the forum!

First of all, running your samples in batches is fine if:

  • in each batch, you have at least ~1M reads for Dada2
  • each batch is processed with identical parameters in all steps before Dada2 and Dada2 itself.

I that case you can merge your output files after Dada2.

However, you are getting an error at tree creation step, and to create one tree to rule them... for all batches you still need to merge representative sequences from all batches... So I guess running in batches is not an option for you.

Instead, you may consider following steps:

  1. Use "parttree" option at tree construction step
  2. Remove rare features from the table and representative sequences.

I prefer removing rare features (ASVs, counted, for example, less than 10 times overall and detected in less than 2-3 samples).
This step will reduce the number of unique ASVs by removing ASVs, that are not important for the analyses. Also, it will significantly speed up tree construction step and decrease memory requirements.


1 Like

Hello Joshua,

You made it to this step okay, so the previous steps must have worked!

I concur with Timur: aligning fewer sequences by filtering out ASVs is your best bet.

There are also other tree-building pipelines in the amplicon distribution you could try:

Or maybe SEPP?

1 Like