Alpha diversity statistics including singletons

arwqiime · February 5, 2025, 9:48am

Most metabarcoding pipelines such as dada2 do not include singletons in their analyses pipline. In some situations, it is, however, interesting to include singletons at least for alpha diversity calculations (the singletons were obtained by q2 vesearch cluster-features-de-novo). This paper (Estimating and comparing microbial diversity in the presence of sequencing errors [PeerJ]) descibed how to handlle spurious singletons in alpha statistics.
I looked at the q2 diversity alpha parameters and tried to open the link given in the help menu (https://data.qiime2.org/a_diversity_metrics), but it seems that this link is no longer existing.
Is there an updates link available?

Does anybody know if one of the available parameters already correspond to the procedures mentioned in the paper mentioned above?

Best,

gregcaporaso · February 5, 2025, 8:38pm

Hi @arwqiime,
This document has references for the different available metrics (for both alpha and beta diversity). Does that get you the information that you're looking for?

colinbrislawn · February 5, 2025, 10:52pm

I think it's been moved here: Alpha diversity measures (skbio.diversity.alpha) — scikit-bio 0.6.3 documentation

Related, it's cool to see Dr. Anne Choa still working on alpha diversity. Her 1984 Choa1 method appears to not be a good fit for singleton-free ASV methods like DADA2:

(Thank you @timanix for recommending this paper to me!)

It appears that this problem can be addressed by using DADA2 'pooled'

It looks like Choa has been aware of this problem for awhile, so thank you for brining that paper to my attention! I'll have to look into Hill numbers.

arwqiime · March 17, 2025, 12:12pm

Hello @colinbrislawn and @benjjneb

May I ask a technical question as a follow-up to the post Alpha diversity statistics including singletons, which has been closed a few days ago.
I have read the paper on Richness estimation in microbiome data obtained from denoising pipelines - PMC and applied the modified dada2 parameters to my own sequencing data.

According to the paper, the authors have choosen two modifications to the standard dada2 pipeline:
a) sample-wise processing (standard) vs. pool information across samples (modified)
b) without prior information (standard) vs. include prior information (modified)

After comparing the feature tables obtained by dada2 w/ and w/o 'pseudo' pooling method, I could observed a considerable number of 'singletons' in samples, while the total numver of ASVs did not change to a large extent (about 22 ASVs less in 'pseudo' mode compared to 'default' mode, at a total of 2624 ASVs). When looking at various alpha diversity indices (as shown in Fig. 4 of Bardenhosrt et. al.), I do observe similar (edit) overall tendencies of specific indices (ACE, Chao1, Good's cov., Margalef, Menhinick, to a lesser extent for Shannon) between "independent" and "pseudo" results.

I have also read the great documentation on Increasing the sensitivity of DADA2 with prior information, which explains the use of prior information during the pseudo-pooling method. Here, three pool methods were tested (independent: pool=FALSE, pseudo-pooling: pool="pseudo", and pooled samples: pool=TRUE`). I assume that the pool=TRUE method is not available in q2-dada2.

Since I am using dada2 as implemented in q2-amplicon-2024.10, I am wondering whether the --p-pooling-method 'pseudo' does include both modifications of the paper above in the dada2 version of q2-2024.10?

Best regards,

colinbrislawn · March 17, 2025, 3:39pm

Good morning ARW,

I'm not a statistician, so I'm not sure if I should be giving advice here.

I'm also interested in this question, so let's see what others have to say!

benjjneb · March 20, 2025, 3:23pm

The first modification, pooled processing, is is equivalent to pool=TRUE and is not available through the Q2 plugin.

The second modification described in the paper is somewhat confusing to me, as it involves providing a user-generated(?) list of priors before denoising? This is also not supported in the Q2 plugin. What is supported is pseudo-pooling, which in its implementation uses a first-pass of sample-by-sample denoising to develop a list of ASVs independently detected in at least two samples that is then passed to a second round of sample-by-sample denoising with that prior information. This is pseudo-pooling, pool="pseudo".

Hopefully that answers your question?

arwqiime · March 20, 2025, 4:40pm

Hello @benjjneb
Thank you for the explanations!
I am aware that the procedure used in the paper has been executed in part with third-party tools and is not impleneted in dada2 (standalone or q2).

However, I assume that pseudo-pooling of q2-dada2 goes in the same direction as described in the paper, or in your tutorial. To identify rare features in sequencing data by using dada2-generated ASVs of the first round for the second round.

Great! Thank you for making this possible
Best regards