Singletons and diversity/richness indices

Kat · February 12, 2018, 5:06am

Hello,

After processing my 16S sequences in QIIME 2, I have imported the biom table and taxonomy information in Phyloseq. I would now like to calculate diversity and richness indices. However, I am having troubles as singletons seem to be required to estimate alpha diversity and richness, and my dataset doesn't seem to have any...

During the QIIME 2 processing, I used DADA2 for quality filtering and trimmed the primers and low-quality bases out of my sequences. However, I did not filter out singletons or other features during the pipeline (with feature-table filter-features).

May I please ask for your advice on how to solve this issue?

Thank you very much!

Kat

colinbrislawn · February 12, 2018, 5:07pm

Good morning Kat,

Ah yes, this classic phyloseq warning:

github.com

joey711/phyloseq/blob/master/R/extend_vegan.R#L199-L204


      
          			"The data you have provided does not have\n",
          			"any singletons. This is highly suspicious. Results of richness\n",
          			"estimates (for example) are probably unreliable, or wrong, if you have already\n",
          			"trimmed low-abundance taxa from the data.\n",
          			"\n",
          			"We recommended that you find the un-trimmed data and retry."

The phyloseq developer, Paul J. McMurdie, has strong opinions about how to normalize and filter 16S data sets. This is a friendly warning that your data set my not match his expectations, and mostly serves to raise awareness. (If it was an error, the data would not run, while this warning is shown along with the results.)

I would love to hear from a qiime dev why none of the feature counts in your table were equal to one. "Highly suspicious" indeed!

Colin

Kat · February 13, 2018, 12:24am

Hi @colinbrislawn,

Thank you very much for your answer and insights, yes that's the warning!

Does it mean that we really can't rely on the diversity and richness indices if we have no singleton?

And I am not sure to understand if by "un-trimmed data" this warning message means "data from which taxa haven't been filtered out" or "data in in which the sequences weren't trimmed, such as removing primers or low-quality nucleotides"...

At the moment, my analysis is being done at the family level, so maybe it isn't so unexpected that there is no singleton...?

Cheers,

Kat

thermokarst · February 13, 2018, 1:12pm

Thanks for the sleuthing @colinbrislawn!

@Kat, the DADA2 denoise methods remove chimeras by default --- here is the relevant parameter for chimera detection and removal:

  --p-chimera-method [pooled|none|consensus]
                                  The method used to remove chimeras. "none":
                                  No chimera removal is performed. "pooled":
                                  All reads are pooled prior to chimera
                                  detection. "consensus": Chimeras are
                                  detected in samples individually, and
                                  sequences found chimeric in a sufficient
                                  fraction of samples are removed.  [default:
                                  consensus]

The default method is consensus, which identifies and removes chimeric sequences --- this is probably why you don't have any feature counts of one in your resulting features table (@benjjneb, please correct me if I am wrong!). If you are interested, you could re-rerun your denoising step with --p-chimera-method none, then you can compare the results!

As far as that phyloseq message, I too would be curious to know if that message is still applicable in the world of ASVs, maybe @benjjneb can comment on that? I don't think Paul McMurdie is on this forum (at least not under a username I could find), so your best bet might be to ask these phyloseq questions on the phyloseq support page --- if you do, please report back here on what you learn!

Thanks!

benjjneb · February 13, 2018, 4:07pm

You don't need any singletons to calculate alpha-diversity metrics like Shannon/Simpson. You do need singletons to try to estimate "richness", the total number of seen and unseen types. But you almost certainly shouldn't be trying to estimate richenss... copying this post from the dada2 github issues tracker:

Short answer on how to calculate richness: Don't.

Long answer (with why): The problem of calculating richness comes down to the problem of estimating the number of types (ASVs here) that were observed zero times in the data. For some background, see the two lectures by Amy Willis on alpha-diversity at STAMPS this year: https://stamps.mbl.edu/index.php/Schedule

The information on the zero-observation class comes almost entirely from the number of things you saw 1 time, and 2 times. That is because you are trying to estimate how many rare things are around that you didn't see, and the rare things you did see (that inform you about the unseen others) will be seen 1 or 2 times. Different methods (e.g. rarefaction or Chao's S1) all have that basic dependence.

The problem is that the various statistical techniques that have been developed don't account the most important error mode in next-gen amplicon data: some sequences are being misclassified as new types that simply don't exist in reality.

That is, some errors/chimeras/artefacts/contaminants get interpreted as real variants (because they are different enough) and those show up largely in the singleton-doubleton class and can almost entirely drive richness estimates to nonsensical values. The literature is replete with massive order-of-magnitude richness overestimates due to this.

DADA2 confronts the difficult problem of calling very low-frequency things by not calling singletons, because its too hard to get right and the FPs outweight the FNs (some other methods, e.g. UPARSE, do the same). However, that means all those old richness estimate methods break explicitly, rather than just failing silently (i.e. by giving terribly wrong values).

I haven't seen a method yet that I believe is accurate enough at calling singletons/doubletons in deep NGS amplicon data that would make richness estimates worth doing. Pooled DADA2 is maybe as close as you'll get, but I still wouldn't do it.

There is more there in that github discussion as well from Joey McMurdie and Amy Willis, so I recommend checking it out.

You can also see the dada2 documentation on pooling and pseudo-pooling (new feature) that do allow per-sample singletons to be detected, at the cost of higher computation time. Right now only independent sample inference is available through the plugin, but we intend to expose the pooling options to the plugin soon, you can follow progress on that here: Add pooling options to Q2 workflows · Issue #87 · qiime2/q2-dada2 · GitHub

ebolyen · February 13, 2018, 4:13pm

Here's a short article by Amy on the same topic: https://www.biorxiv.org/content/early/2017/12/11/231878

colinbrislawn · February 13, 2018, 4:35pm

Good morning, Kat,

Yes, yes it could! When merging OTUs by their taxonomy, it seems reasonable to me that every family could appear >1 times and triggering this warning.

This is seperate from the larger discussion about measuring richness raised in this thread.

Colin

Edit: When I said 'untrimmed' I meant "data from which taxa haven’t been filtered out." Preprocessing quality control is still needed.

Kat · February 14, 2018, 5:16am

Hi @colinbrislawn, @thermokarst, @benjjneb and @ebolyen,

I wish to deeply thank you all for your detailed explanations and very interesting references to other discussions and articles! I will take a little bit of time to absorb and explore everything. Of course, I'll let you know if I learn anything more during this process and from the phyloseq support page.

Many thanks again!

Kat