How tiny a stone can you squeeze any blood from?

devonorourke · January 2, 2019, 1:29pm

I'm working through the denoising phase of the pipline and DADA2 has me feeling like this song. I want to feel like this song!.

The dataset in question was generated on a run with very poor throughput, so I'm trying to salvage what was provided. Specifically, all I intend on doing with this current dataset is to assign taxonomy - a list of what taxa are likely present among all samples. No abundance considerations, no per-sample richness... just a simple list.

I'm starting with about 280,000 read pairs split among a few hundred samples that were generated from a 2x300 MiSeq platform run. Yep, there's just about 1000 reads total per sample on average. I began by importing the demultiplexed data into QIIME, and removed the primers with the cutadapt plugin. Then I ran dada2 using the parameters I've done in the past:

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs my.demux_seqs.qza \
  --p-trunc-len-f 0 --p-trunc-len-r 0 \
  --o-table my.table.qza --o-representative-sequences my.repSeqs.qza --o-denoising-stats my.denoisingStats.qza

yet generated an error:

Plugin error from dada2:
  No reads passed the filter. trunc_len_f (175) or trunc_len_r (175) may be individually longer...

That was a bit strange given the error profiles (see below).

A few things to note:

The expected amplicon length of the forward and reverse reads differ: R1 should sum to 273 bp, but so there is about 27 bp of noise (which you can kind of make out in the above error profile). R2 should sum to 278, but the error profile is clearly tanking around 200 bp.
The expected read isn't the same as the expected amplicon mentioned above, because what we're actually sequencing contains primer adapter at the 5' end of each R1 and R2 read, and adapter read through at the 3' end. When properly trimmed, both R1 and R2 should ultimately be reduced to just 181 bp.

And just to highlight the point here: I'm using the expected read, not the expected amplicon when running DADA2.

I thought perhaps my overlap wasn't working as expected so I reduced the --p-trunc_len-f and --p-trunc_len-r parameters, substituting 175 with values of 100 and 150 (same both *-f and *-r). I get the same message.
I then wondered if I could just avoid the trimming entirely and, substituting the value with 0, and still get the same message.

I then went back to the drawing board and tried running cutadapt manually so that I could visually confirm that I do indeed have the exact trimmed amplicons I intend, and indeed, my dataset is trimmed as expected.

Side note: it would be great to have an option to pass along the linked adapters within QIIME as described in cutadapts docs. As it stands now, I can trim these reads using a combination of the --p-adapter-* and --p-front-* parameters, but these aren't able to be recognized as linked in the QIIME implementation as far as I can tell

I can thus confirm that neither the standalone Cutadapt, nor the plugin version within QIIME made a difference - I get the same error no matter what I feed into DADA2.

Now for the weird part... + + ...
I can run DADA2 in single end mode and the process doesn't generate an error:

qiime dada2 denoise-single \
  --i-demultiplexed-seqs my.trimd_seqs.qza \
  --p-trunc-len 180
  --o-table my.table.qza --o-representative-sequences my.repSeqs.qza --o-denoising-stats my.denoisingStats.qza

but when I look at the stats, I think it's passing because it's not denoising. There's no difference in the number of trimmed reads as there are the number of denoised reads for any sample.
And it's not like this is what I should do anyway - I'm performing a denoising step with paired data but implying single-end data. I certainly could join the PEs first, then run this script, which is my current plan at the moment...

One last thought: I took the standalone Cutadapt data (now just my reads, with all primers removed) and tried joining the read pairs with the QIIME installed version of Vsearch. And it worked just fine!

## "$i" represents one of the 288 samples...
vsearch --threads 24 \
    --fastq_mergepairs "$i"_L001_R1_001.trimd.fastq.gz \
    --reverse "$i"_L001_R2_001.trimd.fastq.gz \
    --fastq_minovlen 160 \
    --fastq_maxdiffs 15 \
    --fastqout "$i".merged.fastq;

One peculiar thing to note: at the end of this merging step, 1 of the 288 samples contains zero reads. The remaining 277 samples all contain > 0 reads, and after BLASTing a random sampling of those reads it can confirm they're the kind of data I'd expect.

I'm unclear why DADA2 doesn't see amplicons long enough to overlap when the same kinds of data in previous runs have been successfully overlapped. I'm wondering if it has something to do with the fact that there are so few reads per sample. The temporary .log file produced in incorrect runs suggests that all samples aren't overlapping, but perhaps there's a bug in the script where if one sample results with zero reads, then it goes haywire and the system interprets this as every sample having no data?

It's all dark magic to me.

Thanks for helping me get to feeling like : with my data...

colinbrislawn · January 3, 2019, 9:33pm

Good to see you again Devon. Let's figure out the read cutting settings and cut to the feeling.

You have already identified the core issue:

By default, the dada2 function mergePairs() includes these settings
minOverlap = 12, maxMismatch=0
so no wonder many of your reads aren't merging!

Frustratingly, the dada2 qiime plugin doesn't expose many settings that would let us make the read joining more permissive, but fortunately vsearch does. You could increase fastq_maxdiffs even more to get even more reads to join.

Given that dada2 isn't working on your reads, I think you could proceed forward using a different clustering denoising method. Specifically, you could try deblur or classical OTU clustering, both of which will run fine on paired end reads joined with vsearch.

The results of these will be a feature table with rep sequences that you can use for taxonomy classification.

Hopefully this get's you off to the races in the new year!

Colin

devonorourke · January 3, 2019, 9:54pm

Awesome - thanks @colinbrislawn.
I had no idea about the minimum overlap! Very helpful.
I'm still not clear why the process would just crash entirely - the temporary log file indicates that every sample had no read overlaps, which seems suspicious to me...

Fortunately I have attempted both of your suggestions already. One route was to take the initial data and cluster (I tried both 97 and 99%), then assign taxonomy to that. Even with less than half a million reads, there was still something like 30,000 unique sequences after clustering, which speaks to how valuable denoising is.

My second route was giving deblur a whack for the first time, and that proved successful. There were only something like 1000 unique sequences that got assigned taxonomy following denoising through that pipeline.

Really appreciate your insights and support; I wonder if anyone else using DADA2 has ever encountered a similar problem when they throw a really low-throughput dataset at it. I'm guessing I could just take a mockrobiota community, downsample to something like 1000 reads per sample, and see whether it crashes... Putting that one up on the 2019 todo postit board.

cheers,
Devon

colinbrislawn · January 4, 2019, 11:50pm

Hey Devon,

Sounds like you are on a good track with the other methods, and I don't have a lot to add, but I thought I would qiime-in one last time.

I didn't know about it either! But that's the great thing about open-source software: when it breaks, you get to keep both halves! I found out about the minimum by looking at the source code for the joining function on GitHub.

Of course the other great thing about open source software is that you can just @ the developers out of nowhere and be like "Why aren't my reads joining, Ben????"
The lead dada2 dev is @benjjneb, if we want to ask, but that would spoil the mystery.

I'm not very familiar with deblur, so I'm a littler surprised that is made 1k features while OTU clustering made 30k feature. I wonder at which step in the deblur pipeline are so many of the features are removed, and if many of the total reads are removed as well.

Let me know what you find. I think you are on a good path.
Colin

devonorourke · January 5, 2019, 9:44am

Thanks @colinbrislawn,
Agreed - really love the community support and the potential connections to all the fantastic developers.
Upon further review I think my naive (read: uninformed) approach to clustering was not properly executed and isn't a really fair comparison to the denoised strategy. I totally forgot to merge my paired end reads prior to clustering, so it's likely that a lot of those unique features are simply strands in opposing orientations. I'm now going back and following the guidelines set out in the Vsearch wiki to include several critical steps I failed to incorporate: paried-end merging, pre-clustering reads prior to chimera detection,de novo chimera filtering, and then a final clustering step. We'll see how that performs; my guess is the number of features should be reduced a lot.

system · February 5, 2019, 3:44pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.