Clarification on Sequence counts

Dear All
We have given 59 samples to service provider for sequencing V1-V9 region using Miseq. With the demultiplexed data, we have started QIIME2 pipeline.
Primers used:

3.Total length of the amplicon
V1V3 500 ±10
V4V6 550 ± 10
V7V9 440 ± 10
I followed the following workflow for pair-end

  1. qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path pe-33-manifest --output-path paired-end-demux.qza --source-format PairedEndFastqManifestPhred33
  2. qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --p-trunc-len-f 280 --p-trunc-len-r 200 --o-representative-sequences rep-seqs.qza --o-table table.qza --o-denoising-stats stats.qza

After quality filtering, we lost more than 50% of sequence counts. This may be due to thw trimming parameter (not enough reads to merge).
Hence we tried with Forwards reads alone.

qiime dada2 denoise-single --i-demultiplexed-seqs single-end-demux.qza --p-trim-left 0 --p-trunc-len 280 --o-representative-sequences rep-seqs.qza --o-table table.qza --o-denoising-stats stats-dada2.qza --p-n-threads 4

In this , we could retain nearly 40% of the reads. I have attached table.qzv for your reference. I have also performed rarefaction curve

Alpha rarefaction curve: Observed OTUs, Shannon Index and faith pd

Sampling Depth: 30443

Retained Sequences: 1,552,593

Retained No.of samples: 51 out of 59

  1. Will the reads which I have got after filtering be sufficient enough for the taxonomic classifications in the publication point of view?
  2. Can I take this results as authenticated one?
  3. Will this sequence counts are good enough?
  4. Is there any publications to support our sequence counts? What is the minimum sequence counts for a sample?
  5. Are my rarefaction curve and sampling depth good?

Kindly help. We are at the crucial point to decide to go on or not?
alpha-rarefaction_30443.qzv (461.2 KB)

table.qzv (745.9 KB)
demux.qzv (283.3 KB)

Hi @steffi,

There’s a lot here, so hopefully I can address at least some of it.

Okay, so, first, how much do your sequence counts differ between the paired and single end picking? Because it sounds like you may have a quality problem overall, rather than simply an issue with the read joining. I would recommend considering your filtering parameters, as well as the read length and overlap for each region.

Okay, next, you’re dealing with multiple hypervariable regions. Do your counts take this into effect? So, are there 30,000 sequences when you combine the V1-3, V4-6, and V7-9 from a single sample, or are they seperated by region?

The hypervariable region for sequencing has a big effect on a lot of aspects of data, larger than a lot of biological effects you’re likely to see when dealing with a single body site on humans, as an example.

My recommendation would be to separate your table by the hypervariable region, and then perform parallel analyses targeting what you’re interested in. However, if you want to combine the data, you need to use closed reference picking, rather than a de-novo technique.

Without this information, it’s hard to answer the additional questions.



I have attached the sequence counts variation for both paired-end and single-end analysis. In paired-end, I could retain only 20% of the sequence counts.

counts_variation.txt (1.9 KB)
I am really sorry to ask. How do I know whether it is a combined or separated by region? How can I tell from my raw data?

If it is a combined data, How can I separate from my table.qza and continue my analysis?

Hi @steffi,

Thanks so much for the information. I definitely agree: single end was the way to go, although I worry about the quality of your reads if you’re losing so many.

The answer to this question depends a lot on the experimental design and how you barcoded and sequenced the data. Did you do the PCR in-house, or did the company do it for you?

You’ll be able to separate the regions if either (a) the samples were assigned the same barcode and sequenced on different runs or (b) the samples were assigned different barcodes. In either case, after demultiplexing, you should have 3 times the number of samples (S1-V13, S1-V46, S1-V79; S2-V13, S2-V46, S2-V79…). In case (a), you’ll need to load each region as its own demultiplexed file, denoise/cluster, and then combine the data. In case (b) you can load all the samples with their unique barcodes, and then process them.

If you don’t have that but you haven’t trimmed the primers, you might be able to somehow tease the demultiplexed files apart based on that primer sequence, and separate them.

I guess my other question, which is sort of high level and methological, was about the motivation for sequencing multiple hypervariable regions?


1 Like

Hi @ jwdebelius
Thank you for your explanation

I contacted my service provider. They told that the read numbers are total of all 3 region.

We have given our samples to the service provider. They extracted the DNA, amplified and sequenced.

Regarding this, I have raised a query to my company and waiting for their results. I am not sure, how they have sequenced the 3 regions.

My general query: Is it necessary to separate three hyper variant region from my table.qza? or will it be go wrong if I do not separate them?

Hi @steffi,

It may be wrong. It will definately have some biases that make it hard to compare with other datasets. (Or, it will have different biases that make the comparison hard.) It will also likely alter your results for compositionality compared to a single region. This is because there organism-specific primer biases, and because your ability to detect differences in closely related organisms varies region-to-region.
So, for instance, your ANCOM results might be different. You also (potentially) have multiple very different ASVs representing the same organism but that can’t be combined into a phylogenetic tree. So… I guess, based on that, it is wrong. I think to get a quasi functional from the tree-picking perspective, you might want to run a closed reference picking against a full-sequence reference database.

I don’t know if there’s been a comparison about how combining multiple regions affects your data (although it’s an interesting technical question.)

Also, I guess the idea of running multiple regions without a specific motivation kind of squicks me.
Unless there’s a specific motivation for running multiple regions (like you want to be able to access specific databases or perform a meta analysis), I guess Im wondering why. Like, I don’t think you can assemble reads off the 16s gene to get a better taxonomic resolution, which might be a motivation. And, even in those cases, I think you’d keep the regions separate?



Hi @jwdebelius
Thank you for your well-defined explanation. I have asked my sequence provider for further details on bar-code sequences and waiting for their reply.

In the meanwhile I am trying to separate from demultiplexed files apart based on that primer sequence, and separate them. Is there any method to separate in the QIMME2 environment?

So, Ive not ever done this, so I have no idea about whether or not this will actually work, but I think I might try using qiime cut-adapt demux-single where the barcode you use is the primer sequence, if it’s still in your data. This should (:crossed_fingers:) let you split your files into 3 primer specific files.



Thank you for your time. Just now I got the report from my sequence provider. They said, after the PCR amplification using three primers, they have pooled all three amplicons together and then they have taken into sequencer. They did not use any separate barcodes for regions. I am left with no choice here.

Is it possible to align my reads with 80% identity against V1-V3 and V4-V6 region from 16srRNA databases and retrieve the mapped reads and continue with QIIME2 analysis?

Coming back to my original query: Generally what is the lowest acceptable sequence counts which can be taken for further analysis

Hi @steffi,

I think based on that, I’d trim to the same length, and then do closed reference picking against your favorite reference database. The database you choose will depend on your sample type. Silva may be better for enviromental samples; a lot of people used/are still using greengenes 13_8 for their human guts. So, one of those. At least the identifiers of your sequences may be the same. (You’re essentially doing a weird sort of mixed meta analysis.) I think I’d still align at 97 or 99 identity, so you’re getting confidence in what’s there.
This will let you use a phylogenetic metric which (:crossed_fingers:) will be more robust.
But, over all, your results are likely to be influenced by the mix of regions in the sample as much as anything.

As far as sequencing depth goes, it’s again, optimisation problem. I tend to work in the 5000-10000 seq/sample range for most of my analyses. I find 1000 to be really unstable, and too low, but 5K-10K works well for me. At your sample size, you need to keep as many high quality samples as possible, and 20K rarefaction will let you do that.


1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.