Feature count after DADA2 ~ 25% of initial features

mamillerpa · December 12, 2017, 1:31pm

I'm using DADA2 to process 250 nt paired-end MiSeq reads (515F/926R)

If I run my sequences artifact through the "demux summarizer", the average sample has about 30,000 reads. After running DADA2, I put the feature table through its summarizer and the average sample has about 8,000 reads. 25% seems really low to me. The median FASTQ qualities are good... high 30s up until the end of R2, which goes down to the low 20s. When I put the same sequences through FLASH, almost 100% of the reads are extended.

Why would I get such a low rate a sequences/features returned by DADA2?
If I need to do my own detective work, what output should I be looking at?
I also tried the deblur approach, but I'm getting "Argument to parameter 'demux' is not a subtype of SampleData[SequencesWithQuality]"... is deblur only for single-ended reads?

thanks,
Mark

jairideout · December 12, 2017, 6:11pm

Hi @mamillerpa!

Those results aren't necessarily surprising, as DADA2 and other novel denoising algorithms are much more stringent when identifying true biological sequences from sequencing error/artifacts, particularly compared to the results you'd obtain from OTU picking or read joining alone.

Those sound like reasonable quality scores -- when you're running dada2 denoise-paired, you'll want to use --p-trunc-len-r to trim off those low quality scores on the reverse reads.

To my knowledge, FLASH only joins the reads for you -- it does not perform denoising, which is what DADA2 and Deblur accomplish (denoising will throw away reads that aren't identified as "amplicon sequence variants").

If you'd like to compare read-joining with FLASH vs. read-joining with VSEARCH, you can use qiime vsearch join-pairs to join your reads without denoising or any other quality control. You can then compare those results to read-joining with FLASH. The paired end reads community tutorial will show you how to join your reads explicitly with VSEARCH to accomplish this.

If your goal is to compare DADA2 and Deblur denoising results, you'll have two different processing pipelines:

When using DADA2 to denoise your paired end reads, you'll want to avoid explicitly joining your reads beforehand (e.g. with VSEARCH, FLASH, etc). DADA2 performs best when provided paired end reads that have not already been joined (i.e. SampleData[PairedEndSequencesWithQuality]).
When using Deblur to denoise your paired end reads, you'll need to explicitly join the reads first before handing them off to Deblur (see the paired end tutorial I linked above for examples).

It's hard to say without more details, but here are some ideas:

You'll want to trim off low-quality bases from your forward and reverse reads (see earlier in my post for an example).
You'll also want to make sure that any sequencing artifacts (i.e. non-biological sequences) have been removed, e.g. primers, adapters, barcodes, etc.
After the trimming/filtering described above, do your reads still overlap sufficiently?

The output from demux summarize is useful for choosing appropriate parameters for dada2 denoise-paired. Sometimes the output from DADA2 itself (i.e. the R package being run by QIIME 2) can be useful -- supply --verbose to your command to see the output as the command is running, or take a look at the "debug log" that is created when your command finishes (the path to the debug log will be displayed and you can open that file in a text viewer/editor). Finally, feature-table summarize can be useful for looking at the final results.

We have an open issue to provide more logging info when running DADA2, which will help with this detective work in the future.

If the suggestions I provided above are still not producing reasonable results with DADA2 or Deblur, please provide your .qzv file that is created by demux summarize and I can take a look. You can send me a direct message on the forum if you don't wish to share the data publicly (Dropbox, Google Drive, etc. are good platforms for sharing these data).

I'd also suggest searching around this forum for other suggestions re: debugging DADA2 results. The DADA2 FAQ also has some useful info to consider when inspecting results.

It sounds like you're using an older release of QIIME 2. The latest release (2017.11) allows you to supply a SampleData[PairedEndSequencesWithQuality] file to deblur denoise-16S. If you supply that type of file, Deblur will only denoise the forward reads, and the reverse reads will be ignored (Deblur does not do any read-joining on its own).

If you'd like to have Deblur process your joined reads (instead of only the forward reads in your paired end data), see the paired end reads tutorial I linked above for examples.

Hope this helps!

thermokarst · December 22, 2017, 5:37pm

QIIME 2 2017.12 is now out, and it includes more detailed DADA2 logging.

system · January 22, 2018, 11:37pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.