Should alpha diversity with kmerizer be recalculated based on results of alpha-rarefaction?

Greetings Qiime2 team, :qiime2: :hot_beverage:

I am a new user of qiime2, analyzing ITS amplicons in the q2-boots-amplicon-2025.7 conda environment. I am not experienced in working with alpha and beta diversity measures.

I hope that General Discussion is the right category to post this question about alpha diversity. Please let me know if otherwise.

I believe I understand the basic interpretation of the two types of graphs that are the outputs of the alpha-rarefaction action:

The top graph shows that a read depth of around 50,000 per sample was sufficient to capture the total Shannon diversity present in each sample.

The bottom graph shows that for this metadata column, "Season", a sampling depth of 50,000 would include all of my samples in order to calculate the Shannon index. However, a sampling depth of 275,000 would exclude a few samples from each category (losing 2 samples from the Wet Season, and 5 samples from the Dry Season categories).

My alpha diversity measures were calculated using kmer-diversity action with a sampling depth of 275,000, like this:

qiime boots kmer-diversity \
  --i-table table.qza \
  --i-sequences asvs.qza \
  --m-metadata-file metadata.tsv \
  --p-sampling-depth 275000 \
  --p-n 10 \
  --p-replacement \
  --p-alpha-average-method median \
  --p-beta-average-method medoid \
  --p-alpha-metrics pielou_e \
  --p-alpha-metrics observed_features \
  --p-alpha-metrics shannon \
  --p-beta-metrics aitchison \
  --p-beta-metrics jaccard \
  --output-dir path/

I get that the sampling depth chosen for kmer-diversity is independent of the sampling depth chosen for alpha-rarefaction. I chose a sampling depth of 325,000 for performing alpha-rarefaction, like this:

qiime diversity alpha-rarefaction \
  --i-table table.qza \
  --p-max-depth 325,000 \
  --m-metadata-file metadata.tsv \
  --o-visualization path/

If I were to do statistical significance testing of alpha measures between Wet and Dry season for this kmer-diversity sampling depth of 275,000, I can be confident that any difference or not-difference detected in Shannon measure between these two groups will be based on most of my samples, and should therefore be representative of my library as a whole.

But, what if that were not the case for a different metadata category; say Male vs Female? At a sampling depth of 50,000 it includes most of my samples, but for 275,000, all but one Male sample drops out. Is the recommended practice to then re-run kmer-diversity with the sampling depth of 50,000? Or are we recommended to stick to the higher sampling depth, and just say insufficient sample size for M v F?

My confusion is about this part (from tutorials): "When grouping samples by metadata, it is therefore essential to look at the bottom plot to ensure that the data presented in the top plot is reliable." I understand how to use the alpha-rarefaction plots to identify that sweet-spot of maximizing number of samples and max measured diversity. But, I'm confused about whether /how to implement that information so that our final alpha and beta diversity measures are based on that ideal sampling depth.

My questions are:

  1. Since my sampling depth for kmer-diversity was 275,000, does that mean the alpha diversity measures that were outputted from kmer-diversity are excluding some of my wet and dry samples, as described above?
  2. Does it also mean that I can/should rerun kmer-diversity with a sampling depth of 50,000, so that the new alpha diversity measures would include all of my samples in the Wet and Dry categories?
  3. From the moving pictures tutorial, it seems another use of the alpha-rarefaction plot is in choosing sampling depths for core-metrics-phylogenetic actions (e.g. faith-pd alpha measure), which is requiring phylogenetic trees. There's not currently a good method to make trees with ITS seq of uneven lengths, so instead, I would use the result of alpha-rarefaction as input to kmer-diversity, a non-tree method of measuring diversity for ITS as described in question 2 above. Am I understanding the uses of the alpha-rarefaction curves correctly for tree and kmer approaches; e.g. the output of rarefaction can be used to determine input sampling depths of diversity measures?
  4. Tutorials such as gut-to-soil and moving pictures have the diversity measures calculated first, and then alpha-rarefaction is run next. But, alpha-rarefaction does not require kmer-diversity output to run, for example. So, why not run alpha-rarefaction first, so it can be used to choose a sampling depth for kmer-diversity?

The fact that rarefaction is not run first in the tutorials to determine sampling depth for measuring diversity, is really making me question whether I am understanding any of this correctly.

Thanks very much for any info or suggestions you could provide! :teacup_without_handle:

Hi @sibilant ,

Short answer: yes, I would also run alpha-rarefaction first to select a suitable sampling depth based on standard diversity metrics, then use that depth for all subsequent diversity tests (including with kmerizer).

Correct.

Yes definitely. 50k is also a very high sampling depth so should be sufficient, and based on your alpha-rarefaction visualization this should suffice (though you should increase the number of steps or run a lower max-depth just to increase resolution in the range of 10-50k as this is where the curve will "bend" more; also check out the observed features metrics as well, which will be more sensitive to sampling depth).

Yes!

Yes perfect. These metrics will also rely on rarefaction/bootstrapping (if run via q2-boots) to normalize sampling depth. Rarefaction is performed before decomposing sequences into kmers, so the same sampling depth that you select for standard ASV/OTU-based metrics should be used for kmerizer (and tree-based metrics).

Yes I agree. Personally, I always run alpha-rarefaction first to select a suitable sampling depth for diversity analyses. I am not sure why it is the other way around in the tutorials but I think it might be so that alpha-rarection does not break up the narrative, e.g., when this tutorial is used in teaching workshops.

It sounds like you are understanding everything! The tutorials are meant to explain basic workflows but are not a cookie-cutter recipe for "the right way" to use QIIME 2... there are many possible options for processing data depending on the research need.

One last thing: check the original publication for q2-kmerizer to read the caveats for using this method for alpha diversity estimates; alpha diversity estimation and interpretation with this method can be a bit more complicated than standard alpha diversity metrics, as the different parameter settings will also impact alpha diversity results. These parameters give you more control, but you need to use them carefully as they will also impact interpretation with some alpha diversity metrics.

Good luck!

3 Likes

Good morning @Nicholas_Bokulich; thanks so much for the helpful reply!
Everything makes sense. :partying_face:

Thanks for the heads-up! I did enjoy both the q2-boots and q2-kmerizer papers.

For alpha diversity estimation, the kmer-specific caveats as I understood them were that:

  1. kmers will deliver a higher alpha diversity measure on average than ASV methods, because each ASV contains many kmers. So, one should not compare Shannon diversity between studies that use ASV vs kmerizer measures (apples to oranges). Likewise for comparing studies where different kmer length and sampling depth used. For this caveat, I plan to also run the ASV-based (non-tree) alpha measures, to help build my intuition of interpreting kmer-based measures, and for cross-study comparison of non-tree ASV measures.
  1. q2-kmerizer parameter settings such as optional TfidfVectorizer and/or prevalence-based filtering e.g. the filter-features-conditionally action, can modulate alpha measures on the effect of a few ASVs with unusually large number of unique kmers compared to other ASVs in that sample. Relevant for this dataset as we seem to have stumbled on a truly novel taxon at relatively high abundance in many of our samples for this dataset (a 330 bp read with no meaningful homology from BLASTn, and it doesn't seem to be an artifact such as a chimera). Thankfully there are a handful of samples where this feature is absent, so I can do a few sensitivity analyses on a subset of library samples with and without this feature to try and choose reasonable parameter settings for TfidfVectorizer. I can use the ASV-based measures to help me evaluate the results of these sensitivity tests. If I understand your paper's discussions, I would anticipate NOT using the optional TfIdfVectorizer for best results in this case.

Thanks again for your helpful reply, and the great new software tools! :wrapped_gift:

3 Likes

Hi @sibilant ,

Yes exactly X2!

Yeah the tfidf is selecting the most variable features, so is okay for beta diversity (where you often want to highlight differences between samples), but what this really means for alpha diversity is less certain. There is also the max features parameter, which will filter based on rank abundance — this is not so dissimilar from doing abundance-based filters of ASVs prior to diversity analysis so will not distort alpha diversity results per se, but this is based on count not on an abundance threshold so just needs to be used with caution and interpreted in context. This is an important parameter to use when running large datasets because otherwise the feature tables get massive and full of singletons and memory explodes... but it will influence alpha diversity results (beta diversity less so), so just keep this in mind during interpretation.

2 Likes

Oh, that makes perfect sense about the max features param for large datasets re: memory management. Thanks again so much! :qiime2: :flexed_biceps: