Do you guys still remove singletons or doubletons these days

Hi, I am wondering if you guys still practice removing singletons or doubletons in QIIME2 these days. Particularly, if you are doing soil microbiome, what do you normally do?

Couple years ago, when my friend attended QIIME 1 workshop, they suggested removing. They even suggested removing any OTUs less than 10?

Can anyone tell me the rule of thumb? Is 10 OK? I need to make sure make manuscript publishable.



Hi @sdpapet,
Unfortunately there is no right answer to your question but perhaps a right for your case can be figured out.
For plugins like Deblur and DADA2, they actually make an active effort to not include singletons. With paired-end data from DADA2 I find that I occasionally still get singletons since merging occurs at the end and may introduce some. But if you do see a lot of singletons after those results you may consider looking deeper into your data in case you accidentally left in your barcodes or primers or something else.
For things like alpha and beta diversity analyses, singletons can be quite influential and should be kept in as long as you can be confident that they are true features. In fact in plugins like breakaway it requires you to keep those singletons/rare taxa in, so it does use those for its model.
In other instances such as when you are doing differential abundance testing with ANCOM or gneiss, singletons are never a good idea since they offer no meaningful information to those models and instead add noise. So for those it is often recommended to remove not only singletons but rare taxa in general such as you described. Depending on the dataset and community source you should get rid of of those rare taxa.
This MY approach and as far as I know there are no benchmarks of this. So please take it with a grain of salt and others can chime in their approach too.
If I have a very diverse community and I have good coverage of reads then I lean towards increasing min frequency threshold to 50-100. In less diverse samples like mouse gut, I lower that to 10-20. In addition I also filter our features that don’t occur in at least 25-50% of my samples. Of course this means that your differential abundance analysis is now not very sensitive to low abundance/ very rare taxa so if that is what you are actually interested in these methods might not be the best choice.
Ultimately, you might have to choose different settings for different questions and analyses of your data.
Hope that helps a little and didn’t add more uncertainty :stuck_out_tongue:



Okay, so my biggest thought currently on the microbiome is one of those wishy-wishy "use your best intuition for your community". But also, be very explicit about the filtering parameters you used and the choices you made. It's good to at least have a motivation of why you made those choices. For instance. the concern for de novo and denoising methods that singletons may be spurious artefacts. That that your test of choice often influences how you filter.

Remember kids, the only difference between screwing around and science is writing it down.

I had a long response typed out, that basically all agreed with the previous post then deleted it, but honestly, this is an excuse to use the Adam Savage gif.

I'll also add that if you're not sure, it's always useful to run a procrustes and mantel test on your filtered data, if you need the reassurance that what you did re-capitulates the original. You get a pretty picture, an R2 and a p-value, which should hopefully help your reviewers confidence in the results.



My two cents here, as I've been knee deep in filtering considerations this week, and would greatly appreciate others thoughts who have seen more datasets than I have (if your sample size is > 1, you win!)...

It seems like there are two major reasons to remove a singleton, or doubleton (fun new word for me @jwdebelius!), or tripletons...

  1. you don't believe were of biological origin (and are an artifact of the sequencing process)
  2. you think they might be of biological origin, but mess with your statistical measure of choice

I've spent a lot of time this week thinking about #1 (and those with lots of experience tend to acknowledge there is no single answer for #1). In trying to address what I could filter out, I've taken a sort of three-pronged approach:
A. Remove whatever obvious ASVs appear to be contaminants, or, are so much a concern that you don't want to include them in your dataset further
B. Use positive controls and negative controls to guide background contamination likely due to sequencing artifacts.
C. Consider what whole-sale removal of reads, at some depth, would do to the resulting dataset.

I wrote about points A and B in a previous post. What I wanted to share here is an observation and/or consideration with regards to read depth. At this point, the data I'm working with is exclusively true samples. My ASVs I suspect are contaminants have been removed from the dataset entirely, and any ASV associated with my positive control has been eliminated (the risk of doing that is for another discussion...)

I've filtered the same dataset with 3 different levels: an observation (element of the OTU matrix) must have either more than 2, 20, or 45 reads. The plot below is generated from these three uniquely-filtered datasets. The plot shows the number of times you expect to observe an ASV N times... so it's a frequency plot of a frequency... But it's easy to understand: the left most bar represents the number of times you'd expect to see an ASV just one in the dataset. It's not a "singleton" read, it's a singleton ASV, regardless of however many reads were in it. The next bar represents the number of times in a dataset you'd expect to see an ASV in two samples. The third bar is how many times you'd see an ASV in 3 samples, and so on.

What you can see is that the shape of the distribution really changes as a function of how many reads you trim from each sample. When you remove just 1 read per observation (figure A) you end up with loads a disproportionately greater number of ASVs present just once in the dataset. You also get a lot of ASVs in just two samples. But along the x-axis, it's sort of flat(ish) until you get out to about 30 or so counts.
In figures B and C, the distributions are steeper - you have a lot greater chance of seeing an ASV in just one or two samples than you do three or four samples, etc. Note that in Figures B and C, we've eliminated any element from our OTU table that had fewer than 20 or 45 reads.
One thing I found interesting about that figure: the degree of singletons and doubletons is quite steep in FigureB, but tripletons aren't. Likewise, in figure C, there is a steep curve only for singletons. I think this speaks to an intuitive feature of what you decide to set your trimming thresholds at: the higher you set that, the more you bias your dataset to towards less rare things (that is, things that are more likely to generate amplicons that are sequenced).

And like @Mehrbod_Estaki (and others!) has noted, the diversity metrics you're planning on using are impacted by the rarity of your data.

I've found QIIME's interactive plots helpful when thinking about what thresholds to set going into certain diversity tests but I think looking at how distributions of ASV abundances is as interesting as read abundances per sample when it comes to making considerations on whether to remove singletons, doubletons, or twentytons.


Another way I've been thinking about how read depth selection impacts my data is by viewing the distribution of ASVs per sample (rather than the distributions of the frequency/occurrence of the ASV itself).

I was curious if there were sequencing-run-specific effects, so the following plot shows how the distribution of ASVs observed in a given sample shift depending on:

  • what your minimum read abundance/depth is (2, 20, or 45 reads)
  • the sequencing run you were sampled from

Like with the histograms in the previous post, things seem to start harmonizing around N20. There's a lot of variability in how many ASVs a sample has when we remove an observation with just 1 read; there's quite a bit less variability when we require at least 20 reads per observation. It's interesting to me that there is certainly run-specific differences in some, but not most, of the sequencing batches.