Dada2 too few reads retrieved as output

lca123 · August 28, 2018, 3:27pm

Hi there,
I am running Dada2 for a set of paired-end Illumina sequencing where I have soil samples (very diverse) and isolated samples (one or two species). Eg.: sample 1 came from soil; sample 2 came from an isolate and so on.
What I found is that Dada2 filter out almost all the sequences from my isolated samples while it keeps many sequences from the soil ones.

command (took about 10h)

qiime dada2 denoise-paired
--i-demultiplexed-seqs biblios-33-demux.qza
--o-table table-biblios-pe-demux-dada2
--o-representative-sequences rep-seqs-biblios-dada2
--p-trunc-len-f 224 \
--p-trunc-len-r 223
--p-n-threads 14
--p-n-reads-learn 1000000
--p-chimera-method consensus
--output-dir out_dada2

** I have checked the qualities outside qiime by creating a histogram of sizes from all the libraries and both f and r trunc-len represent the smaller representative size of reads I have inside those libraries. So, I expected, to not loose reads because of trunc. Here is an example of Qiime output for the sizes**

Demultiplexed sequence length summary

Forward Reads

|Total Sequences Sampled|10000|
|2%|227 nts|
|9%|227 nts|
...
|98%|234 nts|

Reverse Reads

|Total Sequences Sampled|10000|
|2%|223 nts|
|9%|223 nts|
...
|98%|230 nts|

Dada2 output

sample-id input filtered denoised merged non-chimeric
IG002 49827 41058 41058 13 13 # isolated bacteria
IG037 816626 685547 685547 256971 231045 # soil sample

I then thought that would be a result of the large number of identical sequences in the isolated sample and that, after downstream tax-classification, I would retrieve the species I know it's there (or at least the genera). It is supposed to be a Bacillus samples. nevertheles what I found is that those 13 sequences are Ochrobactrum sp. I would not say there are no Ochrobactrum there (who knows) but I am sure there are Bacillus.

I made another test to check whether the merging step by Dada2 would be the problem here. As you can see in the output above read number drastically falls after this step. I pre-merged the reads and run Dada2 denoise-single and that's what I've got:
sample-id input filtered denoised non-chimeric
IG002merged 49827 39928 39928 39928 # wow
IG037merged 816626 661909 661909 575996 # more reads retrieved

The donwstream classification results look like that for the isolated sample:
D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;__ 39904.0 ---> 99% Bacillus sp.
Still, the single-end Dada2 took 10x more time to run (56h)

Now I don't know the why I am loosing too many reads before the Dada2 merging step.
Also, I would like to ask if there is any way of running the classification with no dereplication step so my pipeline could run faster, if that makes sense...

Thanks in advance and I am learning a lot reading this forum!

LH22 · August 30, 2018, 5:05am

Hi,

Have you removed primers/barcodes from your sequences?
Try setting –p-n-threads to 0 on an overnight run. This should speed up the process as it will use all available cores ( assuming no parallel processing needed).

lca123 · August 30, 2018, 4:16pm

Hi LH22
Thanks!
Yes, the primers/barcodes were previously removed. I am considering the problem is within the merge step. Maybe those reads are loosing the minimum of 20 nts for merging. About the threads, there is not so much to do, unfortunatelly, because I only have 16 available.

Mehrbod_Estaki · August 31, 2018, 5:01am

Hi @lca123,
I think your hunch about lack of overlap is a great, pull on that thread!
What is the primer set you are using? You're truncating 153 bp which leaves a lot of primer sets with insufficient overlap. And it's also possible that some of your expected targets are being discriminated against with the overlap issue and others are not, which may explain why your isolates and soil are showing different patterns, purely based on their composition.

lca123 · August 31, 2018, 1:30pm

Hi Mehrbod,
I am using a V3-V4 primer set: 541-805; and running 2x 250 bp.
After cutting the primers I see the reads are around 232-235bp in average and the merged R1+R2 ~428bp.
I have found the overlap size is ~30 nts in average (which I am considering too small) and that's may be the why. As you mentioned my truncating is leaving no overlaping regions.

Here some statistics about the merging:
363492 Pairs (363.5k)
359949 Merged (359.9k, 99.03%)
297980 Alignments with zero diffs (81.98%)
168 Too many diffs (> 12) (0.05%)
1 Fwd too short (< 64) after tail trimming (0.00%)
0 Rev too short (< 64) after tail trimming (0.00%)
3374 No alignment found (0.93%)
0 Alignment too short (< 16) (0.00%)
4 Staggered pairs (0.00%) merged & trimmed
29.61 Mean alignment length
_ 427.42 Mean merged length_
0.18 Mean fwd expected errors
0.23 Mean rev expected errors
0.29 Mean merged expected errors

Thanks!

Mehrbod_Estaki · August 31, 2018, 5:37pm

Hi @lca123,
So the way I do rough overlap calculations are, expected amplicon length = 805R-541F = 264bp. Overlap = 2x250 cycles - 264 = 236bp! On second thought, this doesn't look like a typical V3-V4 region, in fact based on the 541F this would be more V4 only. Can you double check to make sure that is indeed the right forward primer, not something like 341F? Especially since you mentioned, your merged amplicons are ~428 bp. Double check and let us know, if it is indeed 541F then we might be dealing with a whole other issue of Forward reads reaching the other side of the Reverse primers etc.

For now I'm going to assume it is 341F-805F So that would make your amplicon length 805-541 = ~464bp. Overlap would be 2x250 - 464 = 36bp. Notice how I don't include the primer trimming in these calculations since those come off the 5' of our reads and overlap occurs at the 3', so won't be affected. With your current truncating parameters you are truncating (250-224) + (250-223) = 53bp which is indeed more than the 36bp overlap. So clearly there's probably some merging issues. Might be worth doing the forward reads only if you can't find enough overlap with this primer set.

Edit: Where are these merging statistics you reported from? They seem different than the DADA2 output from earlier.

lca123 · August 31, 2018, 8:34pm

Sorry, my mistake, these are 341F primers. So your calculations are correct.
These statistics came from Usearch v10 and represent data from one paired-end sample (R1+R2). The Dada2 output is from seven samples (R1+R2) as I run them together.

Mehrbod_Estaki · September 1, 2018, 3:37am

@lca123

Thought so! This primer set is hard to work with unless you have 2x300 reads since the 2x250 barely gives it enough overlap without dipping into rather poor quality on the 3'. You'll have to play around with it to see if your reads are good enough quality otherwise just stick with the forward reads to be safe.

system · October 2, 2018, 9:37am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.