Analyzing variable length joined paired-end reads with Deblur

toconnell · May 1, 2018, 12:05pm

Hi there,

I am working with some variable length amplicon sequencing data that has been demultiplexed and the paired-end reads have been joined already. I realize this makes the data unsuitable for analysis with DADA2, so I've decided to use Deblur instead. I have a few questions on using Deblur for my analysis.

Does Deblur require all technical sequences (adapters and primers) to be trimmed in advance, or will it handle the trimming? I've read that it will remove adapters and PhiX, but am not sure if it trims adapters from the reads, or if it tosses reads that contain adapter.
After looking over the documentation, it looks like Deblur requires all reads to be the same length, which can be done with the --p-trim-length option. The quality scores are really good for the full length of the joined reads, so would you recommend just setting this value to the shortest observed joined read length? If so, would you expect this to negatively impact taxonomic classification downstream?
On a similar note, would you then recommend trimming reference sequences to the same length as the trimmed amplicon sequences with the feature-classifier extract-reads function before training a taxonomy classifier with fit-classifier-naive-bayes? Do you think this would make a significant difference in taxonomic assignment?

I would really like to squeeze out the best performance I can and would appreciate input on the best approach to take with this data that I have. Thanks so much for any advice you have!

Best,
TO

Nicholas_Bokulich · May 1, 2018, 12:51pm

Hi @toconnell,

The latter. You can use q2-cutadapt to trim out adapters/primers.

Yes, unless if those reads are absurdly short. If you are using 16S amplicons, there should not be a vast amount of variation, though sometimes those variants are interesting/rare but real organisms. There should probably be a narrow distribution of the most abundant seqs, and anything much shorter than that may be artifact/poorly joined reads — but if you're concerned you could blast a couple of these before deciding whether to use a higher length threshold.

You will of course lose a certain amount of information but no it should not impact classification too much. As I mentioned above, 16S gene domains should have a fairly narrow length distribution so trimming to the shortest joined read should still get "in the ballpark" of this length distribution. If you are using ITS or another length-variable marker gene then yes, you may be losing useful information for classification (but such variable genes are often also so heterogeneous that the truncated read may still contain enough information for a good classification).

See the text and notes in this tutorial section. We do recommend trimming (for 16S) and yes it does impact quality, but in my experience it does not make that much of a difference. Extracting the correct domain with the correct PCR primers is more impactful that trimming to the precise trim length. If you are constrained (e.g., by memory or time) then this step is not critical and using an appropriate pre-trained classifier will be fine.

I hope that helps!

system · June 1, 2018, 6:51pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.