Choosing contexts for clawback

Dear all,

I would like to perform taxonomic annotation using naive bayes with bespoke weights. I have trouble -weighting- pros and cons of various Qiita contexts. Here are my problems:

  1. All ASV-related contexts were generated using Deblur. Deblur and DADA2 produce (slightly?) differently biased results. Would therefore using DADA2-denoised data with Deblur-generated weights be invalid or not recommended?

  2. I think that when selecting Qiita context to use for a study, one needs to balance a) number of samples that the context includes versus b) similarity the context to one’s own study characteristics.
    In my situation: my data was sequenced using 341f-785r primers, which corresponds to V3-V4 region. For this data I could select e.g. Deblur-Illumina-16S-V4-150nt-780653 (139516 samples with data) or Deblur-Illumina-16S-V34-150nt-780653 (72) contexts. I assume that choosing (a) will increase precision of weights estimation i.e. the weights based on Deblur-Illumina-16S-V4-150nt will be more similar to true weights for V4 region as compared to Deblur-Illumina-16S-V34-150nt and V3-V4 region. Choosing (b) increases accuracy of weights estimation i.e. my bias should be more similar to Deblur-Illumina-16S-V34-150nt bias than to Deblur-Illumina-16S-V4-150nt bias (in theory). Is there a better way than to follow my nose on this issue?

  3. Is there a way to learn the exact nucleotide ranges for contexts? Without this information I am weary to use my primer-selected reference database (i.e. database with sequences trimmed to include only the region captured by primers used in my sequencing experiment) for clawback. My primers (341f-785r) do not capture whole V3-V4 region, so it is possible that nucleotide range in my reference database will not fully overlap with Qiita context range. I would assume that this is not an optimal situation.
    Of course, I could use full-length reference database, but I could also use slightly increased accuracy and lower processing times of shorter reference database. Finally, I though to first use full-length reference database for estimating bespoke weights and then, using the same reference database, generate primer-selected reference database for all other purposes. The potential caveat that I see in this method is dereplication step done after primer-selection - would this produce any problems downstream?

I would very much appreciate any help!

@BenKaehler @Nicholas_Bokulich @wasade @gregcaporaso

1 Like

Hi @AdrianS85,
Great questions

Probably not. Sure, dada2 and deblur work differently and could lead to slightly different results, but (unless if something has gone horribly wrong), the taxonomic results should not look too dissimilar and that is what is going to count for clawback. In my hands, I get similar enough results with dada2 and deblur at the taxonomic level, but I have not tested explicitly how differences between the two could impact accuracy.

But I do not think it will impact the results — just look at the clawback paper, where we found that using the wrong weights was still usually better than using uniform weights. This correlated directly with the “correctness” of the weights. So small taxonomic differences between dada2 and deblur will not lead to major differences.

Indeed, another great point. You do not need a very large number of samples (we found that even a couple hundred samples was enough in the clawback paper) but more is probably better.

Similarity to one’s own study characteristics is probably more important, given that different primer biases and degrees of taxonomic resolution could influence results. I am inclined to say that using V4 sequences to build weights for V3V4 is not a great idea, but N=72 for V3V4 makes me nervous too.

One option would be to use V3V4 data directly from other studies outside of QIITA in addition or instead, but that is obviously not convenient.

Sure, try it both ways. You have the “two watches problem” but we have not explicitly benchmarked this so my advice might not be much better than following your nose!

Exact overlap probably does not matter too much, except with the caveats above regarding different primer biases. Since you are using taxonomic frequency information, not ASVs for assembling taxonomic weights with clawback, exact overlap probably does not matter (though again, this is just intuition, I have not tested that).

So the same goes for using full-length 16S: as long as you use the same reference database (so that the taxonomies match), it should be compatible, again with the same caveat about primer biases (but otherwise better taxonomic resolution).

Great questions! Please let us know what you find.


Thank You so much for Your help and time! :smile:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.