Figuring out why sequences are lost in Deblur

wasade · November 5, 2018, 9:06pm

Dear David,

Thank you for sending on the stats output at the seed of this thread. It looks like, in some samples, the number of singletons is rather high. This is affected by --p-min-size. You can how --p-min-size is used here where it is handed off to vsearch as part of the dereplication.

These sequences appear to be flagged as artifacts because they are singletons prior to applying Deblur.

It is possible that those samples have a very high number of truly unique sequences. It's also possible that those samples have a high amount of error creating a large number of unique sequences. It's important to note that reads which pass the quality filter are not assured to be error free.

What I recommend is running the forward read through without joining and see how your results look. For many analyses, the forward read is sufficient anyway, and the reverse read tends to have more error. And if using just the forward read is sufficient for the questions you're asking, it may be feasible to proceed from there. If you want to use these singletons with Deblur, I believe that is possible to do so by setting --p-min-size=0.

In terms of what is normal, I'm observing 885,044 singletons at 150nt out of 12M raw reads on the latest American Gut MiSeq run. I've never had a need to join reads, so I cannot comment on what the number of singletons would be if operating on joined data.

Best,
Daniel