Too many singletons after Dada2 with paired-end reads

dada2

(Kara Mosovsky) #1

We just ran Dada2 with paired-end reads for a soil microbiome project. For each of 3 different timepoints (each run separately through Dada2) we are getting 50-70+ singletons in our feature table. By singleton I mean features/ASVs that have only been found once, and in one sample (small sample in screenshot below).


At first we were concerned bc/Dada2 is supposed to remove singletons, and we always thought that singletons are most likely errors (they shouldn’t naturally be present, except in error), but previous forum posts say that some singletons may be expected when running paired-end data because Dada2 only removes singleton reads before merging, and the merging process can actually create some singletons. BUT, those forum posts still said that singletons are rare (<15 or so?) for paired-end data, and made it seem like an error still. We’re trying to figure out whether these singletons are valuable or not, and what we should do with them (filter them out, keep them in).

Already for trouble-shooting, we looked at our raw sequencing data and confirmed that the barcodes were still in our samples (multiplexed), but not primers (so we are confident we’re using the right importing and dumux commands). It’s interesting that all three timepoints have the same trend—does that indicate it’s less likely to be error? Or are all singletons some sort of “noise” or “error” by definition? Before we continue with analysis, we wanted to get some input as to whether this is “normal” or not and whether we should filter this many out of our table.

Brainstorming here…would truncating/trimming our data more aggressively, to only use the highest quality bases, help reduce the number of singletons we are getting? If it was a matter of noise that might help, right?
We have plenty of overlapping room to do so, so I suppose it couldn’t hurt to try!


We’d really appreciate any advice on how to proceed from here.

Kara


(Justine) #2

Hi @Kara,

Ive got a theory, but not a good answer. My suspicion is that because of the way DADA2 handles chimeras, by running each set of time points separately, you’re getting different sequences that survive chimera handling stochastically. It may be that if you denoise your data together, or use a secondary chimera handling approach (don’t use the built-in chimera handling in DADA2, and then use vsearch), you get fewer singletons because more sequences will survive the chimera filtering.

As a secondary suggestion, it may be easier to combine your data in parallel if you use sequences rather than MD5s. Its a pain in the ass, but it makes sure that you’re getting the exact same thing. MD5s are good the vast majority of the time, except when they’re not. :confused:

Best,
Justine


(Kara Mosovsky) #3

Hi Justine,

Thanks for your speedy response, Justine. The thing that makes our situation a bit odd is that each time point was processed and run in a different batch (DNA isolated separately, PCR run separately, barcodes added separately, MiSeq separately, etc.). I was under the impression that quality filtering should be done separately for each batch of data, since each batch will introduce its own quality and error issues. Then, the data can be merged into a single feature table later.

I have a feeling that the “time point” aspect of our experiment isn’t to blame, as each time point could really serve as it’s own mini-project anyway. I think for starters we’ll try truncating and trimming the reads a bit differently before they are merged and see if the lower quality nucleotides was part of the problem…it can’t hurt! If it works, I’ll re-post!


(Nicholas Bokulich) #4

This probably depends on many factors — number of runs, sequencing depth per run, length variation in the sequences and other factors that are likely to increase the odds of losing some sequences during merging — so since you are merging 5 runs that could explain the large number of singletons, but maybe you can point us to the post that you mention?

There is also nothing wrong with filtering out these singletons after dada2 if you are concerned that they may be noise.

I agree, that’s probably the best place to start! Let us know what you find!


(Kara Mosovsky) #5

I suppose the problem is that I don’t know how to tell if they are noise or not. We went back and were a little more aggressive with truncating and trimming this time, and I think it helped a little. We still have lots of singletons (and we haven’t combined any time points yet), but we figured that if they ARE noise/errors, they won’t match to anything when we do the taxonomy strings, and many of them might land in the “unassigned” category. We plan to filter out the “unassigned” taxa strings anyway, so we’d be getting rid of them eventually.


(Justine) #6

I would check your combined table, then! They may be rare enough that they only show up in one sample at each time point, but if they show up multiple times, then they’re more likely to be real. Although, I like your filtering approach.


(Kara Mosovsky) #7

Genius! Thanks Justine!