Singletons and diversity/richness indices

benjjneb · February 13, 2018, 4:07pm

You don't need any singletons to calculate alpha-diversity metrics like Shannon/Simpson. You do need singletons to try to estimate "richness", the total number of seen and unseen types. But you almost certainly shouldn't be trying to estimate richenss... copying this post from the dada2 github issues tracker:

Short answer on how to calculate richness: Don't.

Long answer (with why): The problem of calculating richness comes down to the problem of estimating the number of types (ASVs here) that were observed zero times in the data. For some background, see the two lectures by Amy Willis on alpha-diversity at STAMPS this year: https://stamps.mbl.edu/index.php/Schedule

The information on the zero-observation class comes almost entirely from the number of things you saw 1 time, and 2 times. That is because you are trying to estimate how many rare things are around that you didn't see, and the rare things you did see (that inform you about the unseen others) will be seen 1 or 2 times. Different methods (e.g. rarefaction or Chao's S1) all have that basic dependence.

The problem is that the various statistical techniques that have been developed don't account the most important error mode in next-gen amplicon data: some sequences are being misclassified as new types that simply don't exist in reality.

That is, some errors/chimeras/artefacts/contaminants get interpreted as real variants (because they are different enough) and those show up largely in the singleton-doubleton class and can almost entirely drive richness estimates to nonsensical values. The literature is replete with massive order-of-magnitude richness overestimates due to this.

DADA2 confronts the difficult problem of calling very low-frequency things by not calling singletons, because its too hard to get right and the FPs outweight the FNs (some other methods, e.g. UPARSE, do the same). However, that means all those old richness estimate methods break explicitly, rather than just failing silently (i.e. by giving terribly wrong values).

I haven't seen a method yet that I believe is accurate enough at calling singletons/doubletons in deep NGS amplicon data that would make richness estimates worth doing. Pooled DADA2 is maybe as close as you'll get, but I still wouldn't do it.

There is more there in that github discussion as well from Joey McMurdie and Amy Willis, so I recommend checking it out.

You can also see the dada2 documentation on pooling and pseudo-pooling (new feature) that do allow per-sample singletons to be detected, at the cost of higher computation time. Right now only independent sample inference is available through the plugin, but we intend to expose the pooling options to the plugin soon, you can follow progress on that here: Add pooling options to Q2 workflows · Issue #87 · qiime2/q2-dada2 · GitHub