Inflated number of representative sequences after merging

feature-table
dada2
(Roger Huerlimann) #1

Hi all,

I’ve run about 320 samples across three MiSeq cartridges, using the V3V4 region with 300 bp PE sequencing. I then processed each sequencing run separately through Dada2, and finally merged the table and rep-seq files into one.

For the three runs, I got 24k, 11k and 21k representative sequences, which is already higher than expected in the first place. When I merged those three runs, I got 45k rep sequences. This is a lot higher than other analyses I’ve done in the past.

In contrast, another analysis (three runs, V3 only, 300 bp SE sequencing) of someone else’s data ended in 5k, 5k and 10k of representative sequences, which were merged in 10k. This study had slightly fewer samples overall, but not enough to explain this difference.

Does anyone have any idea what the problem could be and how to fix this? Please let me know if should provide any additional information.

Here is the command I used for the dada2 denoising:
indent preformatted text by 4 spaces
qiime dada2 denoise-paired
–i-demultiplexed-seqs manifestout.qza
–p-trim-left-f 20
–p-trim-left-r 20
–p-trunc-len-f 280
–p-trunc-len-r 240
–p-max-ee 6
–p-chimera-method consensus
–p-n-threads 30
–o-table dada2_table_${1}.qza
–o-representative-sequences dada2_repseqs.qza
–o-denoising-stats dada2_denoising_stats.qza
–verbose

Kind regards,
Roger

0 Likes

(Mehrbod Estaki) #2

Hi @Roger_Huerlimann,
What are the samples you are looking at and why do you think the # of rep-seqs is too high? I ask because those numbers you are seeing are not particularly too high imo. In comparison within mouse gut/stool I generally see somewhere between 8-10k features using the same conditions and primers as you. Compared to humans and soil samples for example those are still considered less diverse. So even 24K isn’t necessarily too high for ASVs, depending on the sample. But, better be sure!
So for sanity’s sake here’s a few items worth checking:

  1. Prior to DADA2 all barcodes, primers, heterogeneity spacers etc. have been removed from the reads
  2. You have used the exact same trimming/truncating and error parameters of dada2 in all 3 runs (this one is very important)
  3. Have a look at the sequence lengths summaries after DADA2. Are all the reads around the expected amplicon size? or are there many reads that are shorter or suspiciously longer? I’ve had a lot of issues in the past with host contamination when I was looking at tissues with abundance of host cells (ex mouse intestines), using V3-V4. Those reads however were much shorter than the expected amplicon so I was able to easily filter them out with positive filtering using qiime quality-control exclude-seqs, sort of like what Deblur does by default.
  4. When you assign taxonomy to these reads, do they all get assigned to a known group or are they mainly defined as Unassigned or at Kingdom level only? You can blast a few of those unassigned ones to see if they are true taxa or some form of contaminants.

Let’s see where we are after these items? Others may have some pro tips as well!

3 Likes

(Mehrbod Estaki) #3

Just one more addition:

In my experience this is to be expected with single-end reads compared to paired-ends. A few reasons come to mind for this, with single-end denoising all singletons are discarded following denoising, whereas with paired-end reads you may still end up with some singletons following merging and those are not discarded by default. You also have higher resolution with PE reads because of the longer reads leading to better differentiating between taxa that may have been collapsed into the same ASV with single-ends. Finally, in the rare event where even a single bp is called incorrectly that will lead to calling of a different ASV and you can imagine this scenario is amplified as read lengths increase.

0 Likes

(Roger Huerlimann) #4

Hi Mehrbod,

Thank you for the insanely fast answer!

It was more a gut feeling than anything else. The samples come from oysters grown in mariculture. The samples are mantle and tissue and are haphazardly distributed across the three runs. So I expected a similar diversity across the three runs, and that they add up to 45k features made me a bit suspicious.

Hmm… let see going through your check list.

  1. I didn’t do any trimming before DADA2, but I did trim primers during the DADA2 process. And there are no barcodes or heterogeneity spaces present in the reads.

  2. Yes, I ran the exact same parameters for all three runs.

  3. This looks ok too based on the size distribution output.
    |Percentile:|2%|9%|25%|50%|75%|91%|98%|
    |Length* (nts):|291|401|402|422|426|427|427|

  4. Coupled with the seemingly inflation of features across the three runs, I also have the problem that my 124 GB machine runs out of memory when I try to do the taxonomic assignment using sklearn. I can move the analysis to my 1 TB server, but before I do that, I thought I’d check on the data first. I wanted to avoid crap in, crap out. :slight_smile: I ran a subset of the samples and around 1k out of 27k are unassigned.

  5. That makes sense to me. I’ve done a preliminary test run with a smaller number of samples, also paired-end etc., which came up with a much smaller number of features, but that could very well be a function of the smaller sample set.

I need to refresh my memory of Dada2 and QIIME2… but you mention that even a single bp will lead to a different ASV. So the 45k features I find are not necessarily 45k different species, but could be different variants within a species. So in the taxonomic assignment they could come up with the same species name I assume?

I guess my main concern is really that the three runs add up to 45k features, which indicates that there are not that many shared features across the three runs, while in reality they should share most of their features and I’d expect something closer to 27k features.

Kind regards,
Roger

0 Likes

(Mehrbod Estaki) #5

One man’s procrastination, another man’s treasure.

Sounds like everything is done correctly and no real issues that I can see. I would consider looking closer at those sequences that are shorter than say the 400 mark. Though the shorter reads only make up 2% of your total reads, 2% of a big dataset could account for a lot of of potential contamination and singletons that may significantly inflate alpha diversities. These days I tend to get rid of all of those using pretty lenient positive filtering parameters before going downstream. This might be one place you could play around with to ease your worries.

As for your memory issues, I would think that 124 GB RAM would be enough, especially if SILVA wasn’t involved but I’ve seen enough people running into the same issue here that I would believe it. There must be a lot of sequences! Hope it works out.

Exactly! The whole point of these ‘exact sequence variants’ is as you said to be able to perhaps correctly identify even strain variation. So the 45k ASVs you have are very likely to collapse down to a much smaller number of species when you get around to identifying them.

Maybe, but not necessarily. That all depends on the total number of sequences you have combined between the 3 runs. You could still have majority shared taxa but just lots of them? There’s a lot of different things we could speculate but let’s just see how the data reveals itself and go from there :slight_smile:

2 Likes