q2-clawback: are longer contexts or more abundant contexts better?

aghudson · May 5, 2023, 5:06am

Hi,

I was wondering if anyone had any further advice on choosing an appropriate context for use in q2-clawback.

I want to classify 16S V4 gut samples. I have samples taken from different sections of the gut, but having read the associated article, it appears that having some taxonomic weighting, even if derived from a different gut section for certain samples, would be more accurate than basic uniform weighting of microbial taxa. So, I will choose "Animal distal gut" as my empo_3 environment as this has the largest count size.

For my choice of context, I am bit more uncertain. In the tutorial it suggests using the context for sequence variants with the longest length. This ends up being: Deblur-Illumina-16S-V4-150nt-780653. However, this tutorial was created a while ago and there are similar contexts for 16S V4 with longer sequence lengths e.g. Deblur_2021.09-Illumina-16S-V4-250nt-8b2bff. Should I use this context instead? The counts for this context are much lower than for the one mentioned in the tutorial (4621 vs. 218297). Is the context count a useful criteria for choosing the context?

Thanks in advance for any advice on choosing the most appropriate context.

Nicholas_Bokulich · May 5, 2023, 8:53am

Hi @aghudson ,

The short answer: I recommend trying both and see if you get a different result (I would be interested to hear!) @BenKaehler might have some ideas.

We have not tested with these newer contexts, so to be honest I don't think we can say with certainty. For V4 specifically, 250nt is not that much more informative than the first 150nt... but for other markers this will differ.

In general, longer contexts would give more resolution, and a large number of contexts is not needed to train either (but we do not have a benchmark on how many are needed to give an accurate prediction).

Sorry there's not a clear-cut answer here as we have not tried. Please let me know what you find!

aghudson · May 9, 2023, 8:46pm

Hi @Nicholas_Bokulich,

The short answer is that the 200nt and 250nt contexts did not work, giving the error message:

"Plugin error from clawback:
max() arg is an empty sequence"

The 150nt did run successfully. My (naive) interpretation of this outcome is that as the 200nt and 250nt 16S V4 contexts had much lower counts compared to the 150nt (9666, 4621 vs. 218297), my chosen empo_3 environment "animal proximal gut" (count: 4215), was not represented in these smaller contexts.

I would be happy to find out if there was another reason for this. I still need to compare the taxonomic assignments of the clawback-weighted classifier compared to the uniform-classifier.

Nicholas_Bokulich · May 10, 2023, 10:22am

Hi @aghudson ,

Thanks for testing and confirming. As you inferred, it sounds like the context exists but does not have any samples associated with animal proximal gut, so the function returns an empty sequence error (i.e., no data).

Excellent! Glad to hear that you can move ahead with this.

You may also want to try building custom weights, e.g., if you have human or mouse gut samples then you might want weights specific to the species. The number of samples may be lower but may yield more accurate weights for your species.

Good luck!