My two cents here, as I've been knee deep in filtering considerations this week, and would greatly appreciate others thoughts who have seen more datasets than I have (if your sample size is > 1, you win!)...
It seems like there are two major reasons to remove a singleton, or doubleton (fun new word for me @jwdebelius!), or tripletons...
- you don't believe were of biological origin (and are an artifact of the sequencing process)
- you think they might be of biological origin, but mess with your statistical measure of choice
I've spent a lot of time this week thinking about #1 (and those with lots of experience tend to acknowledge there is no single answer for #1). In trying to address what I could filter out, I've taken a sort of three-pronged approach:
A. Remove whatever obvious ASVs appear to be contaminants, or, are so much a concern that you don't want to include them in your dataset further
B. Use positive controls and negative controls to guide background contamination likely due to sequencing artifacts.
C. Consider what whole-sale removal of reads, at some depth, would do to the resulting dataset.
I wrote about points A and B in a previous post. What I wanted to share here is an observation and/or consideration with regards to read depth. At this point, the data I'm working with is exclusively true samples. My ASVs I suspect are contaminants have been removed from the dataset entirely, and any ASV associated with my positive control has been eliminated (the risk of doing that is for another discussion...)
I've filtered the same dataset with 3 different levels: an observation (element of the OTU matrix) must have either more than 2, 20, or 45 reads. The plot below is generated from these three uniquely-filtered datasets. The plot shows the number of times you expect to observe an ASV
N times... so it's a frequency plot of a frequency... But it's easy to understand: the left most bar represents the number of times you'd expect to see an ASV just one in the dataset. It's not a "singleton" read, it's a singleton ASV, regardless of however many reads were in it. The next bar represents the number of times in a dataset you'd expect to see an ASV in two samples. The third bar is how many times you'd see an ASV in 3 samples, and so on.
What you can see is that the shape of the distribution really changes as a function of how many reads you trim from each sample. When you remove just 1 read per observation (figure A) you end up with loads a disproportionately greater number of ASVs present just once in the dataset. You also get a lot of ASVs in just two samples. But along the x-axis, it's sort of flat(ish) until you get out to about 30 or so counts.
In figures B and C, the distributions are steeper - you have a lot greater chance of seeing an ASV in just one or two samples than you do three or four samples, etc. Note that in Figures B and C, we've eliminated any element from our OTU table that had fewer than 20 or 45 reads.
One thing I found interesting about that figure: the degree of singletons and doubletons is quite steep in FigureB, but tripletons aren't. Likewise, in figure C, there is a steep curve only for singletons. I think this speaks to an intuitive feature of what you decide to set your trimming thresholds at: the higher you set that, the more you bias your dataset to towards less rare things (that is, things that are more likely to generate amplicons that are sequenced).
And like @Mehrbod_Estaki (and others!) has noted, the diversity metrics you're planning on using are impacted by the rarity of your data.
I've found QIIME's interactive plots helpful when thinking about what thresholds to set going into certain diversity tests but I think looking at how distributions of ASV abundances is as interesting as read abundances per sample when it comes to making considerations on whether to remove singletons, doubletons, or twentytons.