Are full length 16S bespoke taxonomy references ok to use on specific regions?

Good questions, @Mehrbod_Estaki!

Yes. I am still benchmarking this but, anecdotally speaking I would say these are safe to use and in my hands I get very reasonable results.

Really the quality of these weights will be related to some degree to how well full-length sequences perform for uniform classification (since that is how these weights are being built off of the source data) and, as you say:

Even though the different amplicon targets will have their own biases (and hence true species frequencies may be slightly different), these weights will almost certainly still outperform uniform weights, because we find that even weights from related sample types outperform uniform and the accuracy improvement yielded by bespoke is correlated with the fitness of the weights (see the clawback paper for more details).

So we actually added the full-length 16S weights recently for this very use case (implying our approval of this methodology): weights may be assembled from taxonomic frequencies observed using one subunit of 16S, and then used to classify another 16S domain. Even with primer bias factored in, the weights should still be quite close to the true taxonomic frequencies of the target region, and much better than assuming uniform weights!

As always, though, take a look at your results and decide whether the classifications you get make sense. Bespoke classification quality will be tied to the quality of the source data, so the method will improve over time but will be limited by contemporary limitations in the reference databases (e.g., misannotated sequences) and source data (e.g., misannotated samples, contaminants). Don't like the result? Build better weights! Then share them on readytowear so others can try them on.

Not at all, to my knowledge. Feature classifier is just providing more feature metadata (no matter what reference, weights, classification method, etc you use) and is not touching the feature IDs. Fragment insertion does not do anything with that feature metadata, and is also operating on the sequences themselves. As long as neither alters the feature IDs (they don't) then they should not interfere with each other.

4 Likes