I would like to perform taxonomic annotation using naive bayes with bespoke weights. I have trouble -weighting- pros and cons of various Qiita contexts. Here are my problems:
All ASV-related contexts were generated using Deblur. Deblur and DADA2 produce (slightly?) differently biased results. Would therefore using DADA2-denoised data with Deblur-generated weights be invalid or not recommended?
I think that when selecting Qiita context to use for a study, one needs to balance a) number of samples that the context includes versus b) similarity the context to one’s own study characteristics.
In my situation: my data was sequenced using 341f-785r primers, which corresponds to V3-V4 region. For this data I could select e.g. Deblur-Illumina-16S-V4-150nt-780653 (139516 samples with data) or Deblur-Illumina-16S-V34-150nt-780653 (72) contexts. I assume that choosing (a) will increase precision of weights estimation i.e. the weights based on Deblur-Illumina-16S-V4-150nt will be more similar to true weights for V4 region as compared to Deblur-Illumina-16S-V34-150nt and V3-V4 region. Choosing (b) increases accuracy of weights estimation i.e. my bias should be more similar to Deblur-Illumina-16S-V34-150nt bias than to Deblur-Illumina-16S-V4-150nt bias (in theory). Is there a better way than to follow my nose on this issue?
Is there a way to learn the exact nucleotide ranges for contexts? Without this information I am weary to use my primer-selected reference database (i.e. database with sequences trimmed to include only the region captured by primers used in my sequencing experiment) for clawback. My primers (341f-785r) do not capture whole V3-V4 region, so it is possible that nucleotide range in my reference database will not fully overlap with Qiita context range. I would assume that this is not an optimal situation.
Of course, I could use full-length reference database, but I could also use slightly increased accuracy and lower processing times of shorter reference database. Finally, I though to first use full-length reference database for estimating bespoke weights and then, using the same reference database, generate primer-selected reference database for all other purposes. The potential caveat that I see in this method is dereplication step done after primer-selection - would this produce any problems downstream?
I would very much appreciate any help!