q2-kmerizer: a QIIME 2 plugin for k-mer-based diversity analysis

Nicholas_Bokulich · February 21, 2025, 12:08pm

Hi everyone, I am excited to share a new plugin, q2-kmerizer for k-mer-based diversity analysis, supervised learning, and more, allowing comparisons of samples/communities while accounting for the genetic relatedness of ASVs/OTUs observed in each community.

Installation instrucation/source code: GitHub - bokulich-lab/q2-kmerizer: QIIME 2 plugin for generating kmers from biological sequences
Article: https://doi.org/10.1128/msystems.01550-24

K-mer decomposition is the process of breaking down DNA sequences into their constituent k-mers (sub-sequences of length K). This process provides a rapid method for comparing sequences, by evaluating sub-sequence information. The presence/absence and frequency of individual k-mers in a sequence can be used to measure the similarity of DNA sequences, e.g., for taxonomic classification, as well as measuring genetic relatedness. K-mer decomposition is already used in some places in QIIME 2, e.g., in q2-feature-classifier for taxonomic classification using naive Bayes classifiers.

q2-kmerizer uses k-mer decomposition to generate k-mer frequency profiles (as FeatureTable[Frequency] Artifacts) for microbial communities, by counting the frequency of individual k-mers from each sequences detected in a community. This allows you to compare entire communities based on their genetic similarity, e.g., using beta-diversity, supervised learning, or (with some caveats) alpha diversity measured on k-mer frequency profiles instead of ASV/OTU/taxon counts. Hence, diversity estimates, supervised learning, and other analyses can be performed that compare samples/communities based on subsequence information (i.e., genetic signatures), taking this genetic information into account when measuring distances or making predictions. This is complementary to standard diversity metrics by providing a different "view" of an ecosystem: standard metrics (which do not account for genetic similarity) will treat all features as equally unrelated when making estimates based on presence/abundance of individual features, whereas k-mer decomposition will allow these estimates to be weighted by genetic similarity, so that samples that contain genetically similar features will become more similar (as they share more k-mers), and those that contain genetically dissimilar features become more dissimilar (as they share fewer k-mers). Hence, k-mer-based diversity metrics provide similar diversity estimates to phylogy-aware diversity metrics (Fig 1; see article above for full details). But k-mer profiles can be used for more than just diversity estimates; the profile information is created as a feature table, which can be compatible with many downstream methods in QIIME 2, e.g., for machine learning with q2-sample-classifier (for which k-mer-based predictions also show some advantages).

Figure 1. Comparison of Jaccard distance PCoA based on ASV frequency (left) and k-mer frequency (middle) to UniFrac distance based on ASVs (Right) using data from the Earth Microbiome Project.

This has several advantages that make this approach quite flexible and useful as a complementary approach for diversity and other analyses:

K-mer counting allows rapid diversity comparison between samples, yielding ordinations that are well correlated with those derived from phylogenetically aware diversity metrics.
No sequence alignment/phylogeny estimation is necessary. Hence, k-mer counting allows very rapid sample comparison of samples based on sequence similarity, e.g., for diversity estimates or supervised learning, allowing such analyses orders of magnitude faster than alignment/phylogeny estimation methods (e.g., seconds to analysis vs. hours for alignment/phylogeny in a typical 16S rRNA gene sequencing experiment).
This approach allows sequence-based diversity estimates also for non-coding regions with high mutation rates and poor phylogenetic signal, e.g., ITS sequences, for which phylogeny-aware metrics can be inappropriate (due to the inaccurate phylogenies with distorted branch lengths that these regions can sometimes generate).
K-mer counting can be used with any standard diversity metric, providing a high degree of flexibility and compatibility. Hence, k-mer profiles can be compared with both qualitative (e.g., Bray Curtis dissimilarity) and quantitative metrics (e.g., Jaccard distance), as well as compositionally aware diversity metrics (e.g., Aitchison distance).
Totally reference-free! No reference data are used when counting k-mers, eliminating the need for curated reference sequences, alignments, or phylogenies.

For usage instructions see the GitHub repository link above. To read more about q2-kmerizer, its uses, benefits, as well as important caveats to consider, see the mSystems article linked above. For any questions, bug reports, etc, please open a new issue on this forum. Thanks in advance!

jwdebelius · February 21, 2025, 2:36pm

This looks super cool! Thanks @Nicholas_Bokulich and the team behind it!

gregcaporaso · February 21, 2025, 2:51pm

Agreed, thanks for sharing @Nicholas_Bokulich. Let me know if you need any assistance getting this up on the new QIIME 2 Library (the most up-to-date instructions are here).