Too many rep-seqs?


I have some sequencing from miSeq and what I am finding is something unexpected for the researches I've been working with.

They claim that I am having too many species detected so they asked me to plot a "saturation curve" for my samples and I couldn't see one...

So I revisited my sequencing QC and although I have some issues regarding base quality in the middle of the reads (~25% os bases are below 20 but above 10) , I would not expect so many rep-seqs.
I have run cutadapt do remove those low quality but I still have too many rep-seqs.

I wonder if dada2 parameters are too relaxed and that I am finding false positives?
Should I discard this sequencing?
Are the researches wrong and we have a large variability in our samples ?
Should I filter low coverage rep-seqs? I have good coverage for some samples like ~96187 reads. Should I consider only rep-seqs with % of frequency?

I know those answers will not be answered here, but maybe with some guidance I would be more confidant in my pipeline to go on with analysis.

Any thoughts will be very helpful and appreciated.


Hi @borgesrodrigo,
Thanks for posting! Some answers to your questions:

The “unexpected” is encountered more frequently when:

  1. comparing high-throughput sequencing results to older techs
  2. comparing results obtained with very high coverage (e.g., with newer seq platforms) to lower-coverage results (e.g., older studies with pyrosequencing or GAIIx)
  3. comparing results obtained with dada2 to detect ASVs (e.g., with QIIME2) vs. OTU picking (e.g., results from qiime1 pipelines)

I am assuming that one or more of these conditions are present in your case, and if so there is no reason to be alarmed!

Species? sequence variants? OTUs? If you are using dada2 to detect ASVs, this number can frequently be higher than expected OTU counts… because ASVs are essentially 100% OTUs! and thus much more sensitive for detecting variants. (dada2 can also often yield fewer ASVs than OTUs because of the quality filtering used in this method so this can be unpredictable)

You can do this in QIIME2 with the command described in this tutorial

dada2 would probably wind up removing many of those sequences, depending on the overall read quality, but maybe not. 10-20 is quite low — you may want to adjust the trim length parameters with dada2 to trim reads to the area where read quality starts dropping off.

You could try adjusting the dada2 parameters to make the quality filtering more stringent. @benjjneb may have some advice on particular parameters to try — sharing a read quality plot may help us get an idea of what your read quality looks like.

It sounds like no — but sharing a read quality plot would help. A few more things that could help:

  1. do you have positive or negative control samples to examine? This could help identify any putative contaminants.
  2. what does the taxonomic composition look like relative to prior high-throughput sequencing analyses of these sample types?
  3. what sample types are you using?
  4. you could attempt to replicate prior protocols — use q2-vsearch to perform OTU picking (I’m assuming that is what would have been used), filter low-abundance OTUs (if performed previously), rarefy to prior sequencing depths (to make sure that higher read counts are not responsible for the higher diversity detected), and compare compositions to prior results.

Anything is possible — see my above comments about protocol changes that could explain surprising results, and approaches to infer contaminants and compare to prior results.

This should not be necessary with dada2, but if you are concerned (e.g., you do have a very large number of unique sequence variants with low abundances), you could do this if you have reason to believe they may be spurious. (why would you think that though? I always prefer evidence and thorough testing to “gut feeling”)

I hope that helps! All in all, we could use more context on your experiment (e.g., sample types) and more concrete examples of the issues you are having (e.g., show us taxonomic composition of prior data compared to your results) in order to provide more detailed support for your issue! Some other questions that would help us:

  1. what sequencing protocol are you using (and how were prior results obtained)?
  2. what read length?
  3. how many samples? how variable are these samples?

Good luck!


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.