Deblur, why the highest sequenced sample gets removed

I’m using the forward reads of a paired-end sequence 16S run and denoising with deblur. This is the summary artifacts of my samples pre and post denoising: pre-deblur & post-deblur
I noticed that following the denoising I lose 2 samples, which happened to be the 2 problematic samples to begin with: one only had 104 sequences which is a no brainer, get rid of it, but the other happens to be the sample with the (abnormally) highest sequences ~ 3.4 mil. I was surprised to see this sample drop, so I wanted to clarify something regarding how deblur handles this scenario. Does this sample get dropped because it simply has much higher sequences compared to the rest, or were they problematic reads that deblur detected as nonsense? If the former, I would think it might be better to implement a system to rarefy that one sample to a depth that would be acceptable by deblur as to not waste data?

1 Like

Hi @Mehrbod_Estaki,

Each sample in Deblur is processed independently, with the exception of a post-processing filter --p-min-reads which drops low abundance sOTUs across samples and can be disabled by setting it to zero. In other words, if you run Deblur explicitly per sample, or with a collection of samples, the results will be identical if you set --p-min-reads to 0.

In q2-deblur, we presently only save the reads which pass the positive filter whereas execution directly through deblur workflow will save the results which fail to pass the positive filter. What most likely happened is the reads in your 3.4M sample were dominated by artifacts or non-16S. We are planning on exposing this “non-reference hit” data in the plugin but haven’t done so yet. Which reminds me, we hadn’t created an issue about that until just now…

Best,
Daniel

1 Like

Thanks for clarifying @wasade. One follow up question, in the event that there are real16s sequences hidden in the mound of artifacts/non-16S would there be anyway to rescue those reads? Perhaps subtracting the “non-reference-hit” data from the full data and working with the remains? Not sure if that’s logistic at all or not…

1 Like

It’s unlikely as the parameters are permissive. We did an exploration relative to V4 reads derived from Greengenes to determine the parameters used for the positive filter.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.