How low can I go? (into and out of denoising)

devonorourke · November 16, 2018, 2:24pm

I want to know how low I can go. I'd love to know why I shouldn't go any lower.

Specifically, I'm trying to understand how the inclusion (or exclusion) of a sample with some small number of reads influences the denoising steps (either with Deblur, DADA2, or unoise). One one hand, it seems like the inclusion of the entire dataset provides the true representation of the error present in the dataset. On the other, I wonder if samples with low numbers of reads have a greater propensity for chimeras or greater error?

If time and computational resources are no issue, is it best to include everything going into denoising?

There are lots of great posts in this forum about sampling depth considerations (two, for example here and here) but I'm pretty sure these relate to the data going into diversity tests (so presumably after denoising).

Thanks for your consideration and comments!

Nicholas_Bokulich · November 16, 2018, 3:47pm

This is a very interesting question. I am not sure there is a good answer, but my feeling is that the current workflow for pre-filtering is adequate to address your concerns, and that without a good benchmark I would not venture into untested territory. I have not heard of anyone benchmarking this — if you do please share your results!

Low quality issues are what pre-filtering steps in the dada2 pipeline are meant to take care of — e.g., the max-ee parameter is going to drop all sequences with (by default) > 2 expected errors. So I would not worry too much about this.

Yep, those are all read depth questions, so post-denoising.

But let's see what others have to say! Maybe @benjjneb or @colinbrislawn have some ideas about this!

devonorourke · November 16, 2018, 3:53pm

Thanks @Nicholas_Bokulich - glad to hear I'm not crazy.

I have about 4000 samples that generated some amount of sequence data; these data were generated by pooling between 200-600 sequences per lane (HiSeq or MiSeq). There are 12 lanes of sequencing data in all.

The following visualization should give you a sense of how variable read depth can be per run. All of these data were generated on a MiSeq (HiSeq data still being dedup'd at the moment). What's crazy is the bimodality to a lot of these runs.

Nicholas_Bokulich · November 16, 2018, 4:04pm

That bimodality is really interesting. Is there any reason you would expect that? E.g., you have a mixture of high- and low-biomass samples? Or differences in amount of DNA inputs?

It would be interesting to plot per-base quality scores using violin plots or something along those lines to see if you see bi-modal quality scores in different libraries.

I still think you should just let dada2 do its job unless if

you want to try benchmarking first!
you get funky results downstream and have reason to suspect that low-depth reads are interfering with the error model.

devonorourke · November 16, 2018, 4:25pm

Great ideas; can you expand on the violin plot dimensions? I'm not sure how to produce the visualization.

Because of the nature of my reads, I've been doing the preprocessing steps up to and including denoising outside of QIIME. I've used the unoise3 tool to denoise lately because it's much quicker than DADA2 (at least the version I've been using). I'd like to use QIIME for the denoising step, but can't figure out what's needed to import my data...

Per library, my preprocessing includes adapter trimming and paried-end merging (with usearch), and the result is a single a .fastq file of all the reads from that sequencing run. I'm wondering if there is any way to import that into QIIME, and if so, what the format of the header needs to be (and/or what the format of the metadata file needs to be). I'm guessing there's a way to do this in QIIME but I couldn't find it in the tutorials.

Nicholas_Bokulich · November 16, 2018, 4:36pm

looks like you are using R so here

QIIME 2 produces barplots for quality score plotting, but those would not display bimodality and cannot be easily segregated by group, so I think this calls for a custom R job.

If you want help please open a separate topic describing your data and the problem.

Sounds a bit like this outputs something akin to EMP format? Check out the EMP format importing tutorial. But where are the barcodes? Are they now in the headers? That is not EMP — EMP format always has barcodes in a separate file.

Honestly, I'd say start with the rawest data since any preprocessing just tends to mangle the data (from a format compatibility perspective) — otherwise how to handle usearch outputs would be a question best answered the usearch developer.

devonorourke · November 16, 2018, 4:52pm

Regarding the violin plot, I was just curious what the input data are. I'm pretty comfortable in R as far as plotting; I just need to know the expected data structure used for the plot to generate the viz. It sounds like this is an output specifically from QIIME, so maybe I can't do it unless I start at the beginning with the raw data and move through the preprocessing within QIIME itself.

I'll set up a new post about data import question in a bit; will give that a whack on my own first before I bug someone again though.

Many thanks!

Nicholas_Bokulich · November 16, 2018, 4:54pm

quality scores from the fastq data. Quality scores would be converted to error probabilities and plotted as violin plots of the error probability distribution at each base. It's a pretty tall order and not specific to QIIME 2... but would take a lot of work, probably more than it's worth

system · December 17, 2018, 11:05pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.