Deblur-trim length

laibinhuang · June 19, 2018, 11:44pm

Why the shorter I trimmed in deblur, the more sequence I have left,

my sequence quality is very high, almost all >39 (0-500 bases)
The minimum sequence length identified during subsampling was 245 bases

Sample ID Count-demux Count-filter deblur (180) deblur (200) deblur (250)
1sample1 65437 65248 13,965 13,109 11,084
2sample2 89114 88634 23,670 22,240 18,886

so how can I trim my sequence

thank you very much

Nicholas_Bokulich · June 20, 2018, 12:46am

Longer trimming leaves low-quality bases on the end of some sequences, risking having those sequences dropped due to excessively low quality.

It looks like a fairly small % of sequences are being dropped. I would personally opt for longer trimming and losing those reads, unless if you need to squeeze out every possible sequence, e.g., due to low sampling depth.

I hope that helps!

laibinhuang · June 20, 2018, 3:13pm

Thank you very much for your explanation.

first: When I check Q score, and I found the minimum sequence length is 245 bp, which means there is no sequence shorter than this number in my testing sequence, then when I cut at 180, 200, 250, the number of left sequences should be increased, but actually it decreased.

seconde: the Q score >38 when sampling from 0 to 500 bp, which means all sequence is high quality, so which means I don't need to cut down any sequence, but when I use -1 to disable deblur, it makes errors.

Third: The issue is that those number represents the sequence I left after trimming 180, 200, 250, it seems 78% (1-13965/65437=0.7866) of my sequence is lost after I using deblur

is it correct or not: the longer length I trimmed, the more complete sequence I will have, and then much more precise classification will get????

Thank you very much

laibinhuang · June 22, 2018, 2:48pm

I have some question about deblur

(1) I am just curious which reads will remain after deblurring. is reads-hit-reference?
when I check the deblur-stats file, I find a lot of terminologies, like unique-reads-derep, reads-deblur, reads-hit-artifact, reads-chimeric, reads-hit-reference, reads-missed-reference.

(2) When I check Q score, and I found the minimum sequence length is 245 bp, which means there is no sequence shorter than this number in my testing sequence, then when I cut at 180, 200, 250, the number of left sequences should be increased, but actually, it decreased.
Sample ID deblur (180)|deblur (200)|deblur (250)|
sample1 13,965 13,109 11,084
sample2 23,670 22,240 18,886

(3) Based on the sequence above, it seems 78% (1-13965/65437=0.7866) of my sequence is lost after I using deblur

How can I do with this???
Thank you very much

Nicholas_Bokulich · June 23, 2018, 1:12pm

Could you please share your quality score profiles? (e.g., with demux summarize)

That will make it a lot easier to answer your question — I am still not sure what you mean about the minimum sequence length, so looking at the profiles will help.

wasade · June 25, 2018, 3:39pm

@laibinhuang, sorry for the slow response here. I hope my comments below help to clear up some of the confusion here.

Longer sequences increase the likelihood of observing a singleton, which may be real or artifactual. Singletons are by default disregarded by Deblur. Longer sequences may also increase the number of small clusters of erroneous sequences. High quality scores do not ensure the observed DNA sequence was derived from a real DNA molecule. Whether longer sequences leads to improved precision is fundamentally tied to the algorithm and reference database used for the subsequent classification. On the assumption that these are 16S data, longer sequences don’t necessarily greatly improve classification by naive Bayes (as implemented in RDP, see figure 1 here).

Deblur uses a greedy algorithm with a subtractive procedure, where variants (based off hamming distance) from the most abundant read are subtracted. If the number of singletons or small clusters of variants are large, then the “read count” for the inferred sOTUs will be small. These data are also relative abundance not absolute, so a loss of read count does not necessarily mean the ratios of the organisms are different.

The error profile for Deblur was based off of 150nt reads, so it is possible that it is not well fit for long reads. Please recall that the Deblur algorithm requires reads to be the same length, which is likely why Deblur fails when trimming is disabled as it suggests there is some length heterogeneity.

The reads returned by q2-deblur are the ones which recruit to the positive reference database.

If you haven’t had a chance to see it, the algorithm is described both as a mathematical proof and in pseudocode in the supplemental of the manuscript. I particularly recommend the description of the algorithm in the supplemental.

Best,
Daniel

laibinhuang · June 28, 2018, 3:41pm

I have no idea how long can I cut

whichever length I cut, the sequence I left just 20-30%, I lost a lot of sequences

Please check my Qscore and sequence file at different cut off
any suggestions?Qscores.csv (14.3 KB)
sequence-counts left after deblur.csv (7.1 KB)

Nicholas_Bokulich · July 2, 2018, 2:26pm

@laibinhuang,

Could you please attach QIIME 2 outputs (e.g., of demux summarize and the deblur stats)? The CSVs are not as straightforward to review.

system · August 2, 2018, 8:26pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.