q2-kmerizer: a QIIME 2 plugin for k-mer-based diversity analysis

Hi everyone, I am excited to share a new plugin, q2-kmerizer for k-mer-based diversity analysis, supervised learning, and more, allowing comparisons of samples/communities while accounting for the genetic relatedness of ASVs/OTUs observed in each community.

Installation instrucation/source code: GitHub - bokulich-lab/q2-kmerizer: QIIME 2 plugin for generating kmers from biological sequences
Article: https://doi.org/10.1128/msystems.01550-24

K-mer decomposition is the process of breaking down DNA sequences into their constituent k-mers (sub-sequences of length K). This process provides a rapid method for comparing sequences, by evaluating sub-sequence information. The presence/absence and frequency of individual k-mers in a sequence can be used to measure the similarity of DNA sequences, e.g., for taxonomic classification, as well as measuring genetic relatedness. K-mer decomposition is already used in some places in QIIME 2, e.g., in q2-feature-classifier for taxonomic classification using naive Bayes classifiers.

q2-kmerizer uses k-mer decomposition to generate k-mer frequency profiles (as FeatureTable[Frequency] Artifacts) for microbial communities, by counting the frequency of individual k-mers from each sequences detected in a community. This allows you to compare entire communities based on their genetic similarity, e.g., using beta-diversity, supervised learning, or (with some caveats) alpha diversity measured on k-mer frequency profiles instead of ASV/OTU/taxon counts. Hence, diversity estimates, supervised learning, and other analyses can be performed that compare samples/communities based on subsequence information (i.e., genetic signatures), taking this genetic information into account when measuring distances or making predictions. This is complementary to standard diversity metrics by providing a different "view" of an ecosystem: standard metrics (which do not account for genetic similarity) will treat all features as equally unrelated when making estimates based on presence/abundance of individual features, whereas k-mer decomposition will allow these estimates to be weighted by genetic similarity, so that samples that contain genetically similar features will become more similar (as they share more k-mers), and those that contain genetically dissimilar features become more dissimilar (as they share fewer k-mers). Hence, k-mer-based diversity metrics provide similar diversity estimates to phylogy-aware diversity metrics (Fig 1; see article above for full details). But k-mer profiles can be used for more than just diversity estimates; the profile information is created as a feature table, which can be compatible with many downstream methods in QIIME 2, e.g., for machine learning with q2-sample-classifier (for which k-mer-based predictions also show some advantages).

Figure 1. Comparison of Jaccard distance PCoA based on ASV frequency (left) and k-mer frequency (middle) to UniFrac distance based on ASVs (Right) using data from the Earth Microbiome Project.

This has several advantages that make this approach quite flexible and useful as a complementary approach for diversity and other analyses:

  1. K-mer counting allows rapid diversity comparison between samples, yielding ordinations that are well correlated with those derived from phylogenetically aware diversity metrics.
  2. No sequence alignment/phylogeny estimation is necessary. Hence, k-mer counting allows very rapid sample comparison of samples based on sequence similarity, e.g., for diversity estimates or supervised learning, allowing such analyses orders of magnitude faster than alignment/phylogeny estimation methods (e.g., seconds to analysis vs. hours for alignment/phylogeny in a typical 16S rRNA gene sequencing experiment).
  3. This approach allows sequence-based diversity estimates also for non-coding regions with high mutation rates and poor phylogenetic signal, e.g., ITS sequences, for which phylogeny-aware metrics can be inappropriate (due to the inaccurate phylogenies with distorted branch lengths that these regions can sometimes generate).
  4. K-mer counting can be used with any standard diversity metric, providing a high degree of flexibility and compatibility. Hence, k-mer profiles can be compared with both qualitative (e.g., Bray Curtis dissimilarity) and quantitative metrics (e.g., Jaccard distance), as well as compositionally aware diversity metrics (e.g., Aitchison distance).
  5. Totally reference-free! No reference data are used when counting k-mers, eliminating the need for curated reference sequences, alignments, or phylogenies.

For usage instructions see the GitHub repository link above. To read more about q2-kmerizer, its uses, benefits, as well as important caveats to consider, see the mSystems article linked above. For any questions, bug reports, etc, please open a new issue on this forum. Thanks in advance!

5 Likes

This looks super cool! Thanks @Nicholas_Bokulich and the team behind it!

1 Like

Agreed, thanks for sharing @Nicholas_Bokulich. Let me know if you need any assistance getting this up on the new QIIME 2 Library (the most up-to-date instructions are here).

1 Like