Minimum number of reads/data needed in a sample for successful data analysis with QIIME2

Hi all,

I am performing ITS and 16S amplicon sequencing on a set of samples. Within this set I included two negative process controls (just Butterfield buffer). The 16S amplicons from these samples yielded some output (the demultiplexed, interleaved sample files are 680 and 698 kb when I look at the size of the files in winSCP, and looking at the demultiplexing summary, the Yield (mBases) =2.) When I went through the data analysis/processing with the samples, I got a result form them and were able to identify organisms.
The ITS amplicons for these samples are smaller- the demultiplexed, interleaved sample files are 50 and 50 kb when I look at the size of the files in winSCP and the Yield (mBases) =0. When I went processed my samples within the QIIME pipeline, these two samples did not appear in my output files at all. I decreased my PHRED filtering score to 20 to let through lower quality reads, thinking that the negative control samples will have low quality reads. These samples didnt appear in output files after I did that.

So I have a few questions:

  1. Is there a minimum number of reads/output data in a sample file needed in order to successfully process your samples, and does this change between ASV/DADA versus clustering?
  2. Is there a way to estimate between file size (as seen on WinSCP) and number of reads?
  3. More broadly, how do folks use negative controls for their sample analysis? As a lower limit threshold, or as a way to judge 'background noise' and subtract this out of all the other samples (or see if its present in other samples.) Or how else?

It makes sense that these negative control samples dont have any fungi present, but some background bacteria. We would like to set thresholds and have an understanding for this kind of thing pre hoc, not post hoc.

Many thanks!

Hello Tammy,

Welcome to the forums! :qiime2:

These are some great questions and I can answer a few of them.

Technically no, but more is better. Initial studies in this field had a few hundred. Now, I would hope for thousands of reads per sample at least!

Not really, because this depends on both the size and complexity of the reads. But there's a better way. After downloading and decompressing the file, you can directly count the number of reads in each file. Qiime does this after import.

This is my favorite question and by far the hardest. I use negative controls to establish 'background noise' as you mention, but what to do next is an open question.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.