Using Metagenome derived abundances as weights for NBC


I recently read the paper "Species abundance information improves sequence taxonomy classification accuracy" and found the QIIME2 clawback implementation. However, instead of using weights provided from readytowear, I wanted to use abundance estimates from metagenomic data to improve the 16S classifier.

I.E My estimates might be Species A: 60%, B: 20%, C:20% and the remaining species (D-Z) would be 0%.

How would I go about implementing this?


Hi @Microbiome_analyzer ,
Good question. We actually tested this in the paper (creating weights based on shotgun data) and we saw minimal improvements (see supplemental results), mostly because:

  1. depending on the method used, shotgun often does not yield better/deeper taxonomic classifications. Doing this using MAGs would be best (e.g., to get full 16S assemblies or similar).
  2. because shotgun methods have their own different biases from amplicon data, so the species abundances can be different depending on the methods used (e.g., whether read mapping counts are corrected by genome size etc)... this makes the weights poorly fit without a good bit of effort.
  3. depending on the methods/databases used, shotgun may use different taxonomic nomenclature. The taxonomies must align for generating weights

All code used in that paper is linked in that paper so you can checkout the github repo to see exactly how everything was done... the shotgun notebook is rather long and would be challenging to adapt (as it is rather specific to the taxonomy and database used there). I plan to revisit this some day soon to make a more generalizable implementation that others could use.

But to answer your question directly:

  1. perform taxonomic classification of shotgun data using your database of choice
  2. run through the clawback tutorial to generate weights from those data
  3. train a 16S classifier using the same exact database, and those weights

those steps would be "easy" but then the output should be compared vs. standard classification and ideally validated by use of a ground-truth to see how trustworthy the shotgun weights are.