Read length selection during extraction from reference databases

Hi Mehrbod,

Thanks for this. How did you find your expected length? I am using another v3v4 primer (EUB_F 5’-TCCTACGGGAGGCAGCAGT / EUB_R 5’-GGACTACCAGGGTATCTAATCCTGTT) and I know the amplicon size is 466 bp but could not find the range for it. What range did you use for training your classifier?

Thanks

Hi @Negin,
Just as a heads up the classifier in this post is a bit old and there are certainly newer versions of this calssifier trained with more recent version of scikit learn floating around the forum.
I didn’t actually set any range in my extraction, just left them at the defaults (which is min length 50). With curated databases I wasn’t too worried about the ranges during the extraction because, well, there shouldn’t be any junk in there anyways. If I was extracting reads from uncurated sources, like NCBI I would probably set ranges. By the way, you can check the parameter settings in the provenance tab of these too!

Hi Mehrbod,

Yes, I found the range in the provenance file after I sent this message. I haven’t specified a range but what I got is quite a range (min 52 to max 1871) with an expected amplicon size of 466, so it might be useful if I extract the reads with a specified range.

Hi @Negin,
Do you mean that when you extracted the reads from the greengenes database using your primers you got a range of 52-1871? Or do you mean the range in your own feature-table?

This was when I extracted reads from SILVA using my primers but without trimming.

Hi @Negin,
I moved this discussion to a new thread.
So to clarify, when you are extracting reads from the SILVA database using your primer sets you are getting some reads that are largely outside of your expected range, as short as 53 bp etc.
I never really thought about this as a potential issue, and it may be that it is not really an issue, assuming your feature-table doesn’t have any short reads such as this. But if your primers hit short regions in the reference database then it is possible it happens in your real data too. I would certainly impose some size restrictions then, something like 100bp above and below your expected size length should be good enough. If you still have doubts about some of the reads that are say 99bp shorter than what you expect, I would blast those and make a decision as what you want to do next. Keep or discard. You can always do a second round of trimming after denoising using this nifty approach by @thermokarst so you don’t need to re-run dada2.
Keep us posted.

1 Like

Hi Mehrbod,

Thanks for your response. My reads are between ~213-435. I ended up using a max and min of 200-600 for training my classifier. This decision was made was based on my reads, expected amplicon length of 466 and the following boxplot that shows a range for v3v4: https://help.ezbiocloud.net/comparison-between-v3v4-and-full-length-sequencing-of-16s-rrna-genes/

1 Like

Sounds good @Negin, just a something to keep in the back of your head incase more troubleshooting is required down the line, the primers used in that page you linked are different than yours, even though they still target V3V4, there may be something with your primers specifically that are hitting other non-16S regions. Actually I’ve used the V3V4 primers linked in that page on mouse colon tissues and I often find it hits mouse host DNA in samples with high host:bacterial DNA. I tend to do a very lenient negative filter with those outputs to toss away anything that isn’t really bacterial looking. Deblur does this by default using very permissive 65% sequence identity threshold and a 50% coverage threshold and I find this to be a very quick way of getting rid of those problematic reads. Might be useful if you are using DADA2 that doesn’t have this positive filtering.

1 Like

Hi Mehrdad,

Thanks for the information. From negative filter, do you mean filtering taxa from negative control? or do you cross-check with the NCBI?

1 Like

Hi @Negin,
Sorry I should have been more clear with that. Negative filter here refers to filtering any reads that don’t hit against something in a 16S reference database to some degree. Deblur uses the greengenes database with 65% sequence identity and 50% coverage (I believe) and this is super fast and works really well in removing host DNA for me when I do V3V4 in intestinal tissues (high host DNA samples). I would recommend this. You could use any other reference database but since the idea is to just get rid of reads that look wildly irregular and non 16S-sh, there’s no need for really comprehensive and big databases.

2 Likes