After processing my 16S sequences in QIIME 2, I have imported the biom table and taxonomy information in Phyloseq. I would now like to calculate diversity and richness indices. However, I am having troubles as singletons seem to be required to estimate alpha diversity and richness, and my dataset doesn't seem to have any...
During the QIIME 2 processing, I used DADA2 for quality filtering and trimmed the primers and low-quality bases out of my sequences. However, I did not filter out singletons or other features during the pipeline (with feature-table filter-features).
May I please ask for your advice on how to solve this issue?
The phyloseq developer, Paul J. McMurdie, has strong opinions about how to normalize and filter 16S data sets. This is a friendly warning that your data set my not match his expectations, and mostly serves to raise awareness. (If it was an error, the data would not run, while this warning is shown along with the results.)
I would love to hear from a qiime dev why none of the feature counts in your table were equal to one. "Highly suspicious" indeed!
Thank you very much for your answer and insights, yes that's the warning!
Does it mean that we really can't rely on the diversity and richness indices if we have no singleton?
And I am not sure to understand if by "un-trimmed data" this warning message means "data from which taxa haven't been filtered out" or "data in in which the sequences weren't trimmed, such as removing primers or low-quality nucleotides"...
At the moment, my analysis is being done at the family level, so maybe it isn't so unexpected that there is no singleton...?
@Kat, the DADA2 denoise methods remove chimeras by default --- here is the relevant parameter for chimera detection and removal:
--p-chimera-method [pooled|none|consensus]
The method used to remove chimeras. "none":
No chimera removal is performed. "pooled":
All reads are pooled prior to chimera
detection. "consensus": Chimeras are
detected in samples individually, and
sequences found chimeric in a sufficient
fraction of samples are removed. [default:
consensus]
The default method is consensus, which identifies and removes chimeric sequences --- this is probably why you don't have any feature counts of one in your resulting features table (@benjjneb, please correct me if I am wrong!). If you are interested, you could re-rerun your denoising step with --p-chimera-method none, then you can compare the results!
As far as that phyloseq message, I too would be curious to know if that message is still applicable in the world of ASVs, maybe @benjjneb can comment on that? I don't think Paul McMurdie is on this forum (at least not under a username I could find), so your best bet might be to ask these phyloseq questions on the phyloseq support page --- if you do, please report back here on what you learn!
You don't need any singletons to calculate alpha-diversity metrics like Shannon/Simpson. You do need singletons to try to estimate "richness", the total number of seen and unseen types. But you almost certainly shouldn't be trying to estimate richenss... copying this post from the dada2 github issues tracker:
Short answer on how to calculate richness: Don't.
Long answer (with why): The problem of calculating richness comes down to the problem of estimating the number of types (ASVs here) that were observed zero times in the data. For some background, see the two lectures by Amy Willis on alpha-diversity at STAMPS this year: https://stamps.mbl.edu/index.php/Schedule
The information on the zero-observation class comes almost entirely from the number of things you saw 1 time, and 2 times. That is because you are trying to estimate how many rare things are around that you didn't see, and the rare things you did see (that inform you about the unseen others) will be seen 1 or 2 times. Different methods (e.g. rarefaction or Chao's S1) all have that basic dependence.
The problem is that the various statistical techniques that have been developed don't account the most important error mode in next-gen amplicon data: some sequences are being misclassified as new types that simply don't exist in reality.
That is, some errors/chimeras/artefacts/contaminants get interpreted as real variants (because they are different enough) and those show up largely in the singleton-doubleton class and can almost entirely drive richness estimates to nonsensical values. The literature is replete with massive order-of-magnitude richness overestimates due to this.
DADA2 confronts the difficult problem of calling very low-frequency things by not calling singletons, because its too hard to get right and the FPs outweight the FNs (some other methods, e.g. UPARSE, do the same). However, that means all those old richness estimate methods break explicitly, rather than just failing silently (i.e. by giving terribly wrong values).
I haven't seen a method yet that I believe is accurate enough at calling singletons/doubletons in deep NGS amplicon data that would make richness estimates worth doing. Pooled DADA2 is maybe as close as you'll get, but I still wouldn't do it.
There is more there in that github discussion as well from Joey McMurdie and Amy Willis, so I recommend checking it out.
You can also see the dada2 documentation on pooling and pseudo-pooling (new feature) that do allow per-sample singletons to be detected, at the cost of higher computation time. Right now only independent sample inference is available through the plugin, but we intend to expose the pooling options to the plugin soon, you can follow progress on that here: Add pooling options to Q2 workflows · Issue #87 · qiime2/q2-dada2 · GitHub
I wish to deeply thank you all for your detailed explanations and very interesting references to other discussions and articles! I will take a little bit of time to absorb and explore everything. Of course, I'll let you know if I learn anything more during this process and from the phyloseq support page.