what proportion of reads is discarded in the deblur step is normal?

Hi there,

I kind of understand each column in deblur-stats.qzv file.
But I don't know that what proportion of raw reads is discarded in the deblur step is normal/acceptable?

For example, the sampling depth choose in deblur step is 120.
In sample A: raw reads 10000; reads-derep 7000; reads-deblur 6060; reads-artifact 60; reads-chimeric 2000; reads-missed-reference 3000; reads-hit-reference 1000.
In this example, I have lots of reads that are supposed to be chimeric or missed hit the reference, only 10% reads left after deblur. I would like to suspect that there is sth. wrong with this sample.
So what proportion of reads-hit-reference, reads-artifact, reads-chimeric, reads-missed in the deblur step is normal/acceptable?

Thank your in advance!


Hello Jericho,

Welcome back to the forums!

I think these questions depend a lot on the microbial community in question and how the samples were prepared.

Having ~30% percent of your reads be chimeric is a little high, but that could be expected if you had to use a lot of PCR cycles to get a signal from low biomass samples.
Having ~50% (:scream_cat:) of your reads miss the database is quite high for a human microbiome project, but might be expected if you are working with samples from a novel environment comprised of understudied taxa :crying_cat_face:

Have you tried processing your data with a database independent method like DADA2? What percentage of your reads does dada2 think are chimeric?


Thanks, Colin.
The extreme example above is not my real data. I made the proprotion of chimeric and missed reads too high to give a “bad” example.

You said, ~30% reads are chimeric was a little high, ~50% of reads miss the database was quite high. Then what porportion of chimeric and reads miss reference is acceptable for a human microbiome project like gut microbiome?
And most important, what proportion of reads hit reference is acceptable in a normal situation? Is 30% too little? I understand the proportion depends, is there a recommended range?

Another question: I should compare chimeric/missed/hit reads with reads-deblur but not reads-raw?

Thanks for all your help!


Good morning,

There is not a recommended range that I know of, because what is acceptable is based on what is expected. Anything is acceptable if it makes sense biologically (and you can convince reviewer #3 :smile_cat:)!


I do something like that when looking at DADA2 results, but I’m not sure what’s the best denominator for deblur.

@pitaman, what do you recommend?

