Why do I have more sequences (therefore more taxonomic groups recovered) when I use only my forward sequences ?

The parameters i used were the following

since I don't have the adapters sequences but the lenght (barcodes + adapters), I used that parameter to cut.
qiime dada2 denoise-paired
--i-demultiplexed-seqs demux.qza
--p-trim-left-f 18
--p-trim-left-r 26
--p-trunc-len-f 315
--p-trunc-len-r 220
--o-table tablen.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza

The results I obtained were

vs the results using only the forward sequences

As you can see, using both forward and reverse I end with 26,855 vs 40,361 using only the forward sequences, this is a huge difference when I assign taxonomy.

I am not sure if I'm cutting too much or I'm not cutting enough or I'm missing something.

Thanks in advance !

Hello @DoctorC, welcome to :qiime2:!

This is not unusual if you have many paired-end reads that fail to merge. Which can drastically reduce the number of sequences. How many failed to merge? This should be in your DADA2 denoise stats output.

1 Like

here are the denoisingstats. Ordered by % of input merged

If the reads that fail to merge are removed, the sequence count should be higher when I use both forward + reverse (thinking that a percent of the reverse sequences are not discarded).

How using some bad quality reverse are influencing the removal of forward sequences?

I've been reading some related publications but I haven't found the answer

Remember you are using the denoise-paired option of DADA2. That is both the forward and reverse reads must pass quality checks, otherwise the read pair is discarded. Passing this stage, if the two reads can not be merged the read pair is discarded.

The two main reasons why reads can fail merging are:

  1. low-quality bases calls in the region of overlap. That is, there is an increase mis-matching base calls for the same position. For example, the forward read may have a low quality base call of an A while the reverse read may have a low-quality base call of a C. To many of these mismatches will cause merging to fail.
  2. The reads are not long enough to overlap. For more details see this thread.

As you can see these two issues can be related. That is, the less overlap you have, in combination with low-quality bases in the region of overlap, will cause a failure of read merging.

I'd suggest truncating your reads a bit more if you can. See the thread I linked above to guide your truncation values. Also, what gene / gene region are you sequencing? 16S rRNA gene? V3V4?

4 Likes

I didn't know this, that's why I'm losing so much sequences.
Thank you a lot for explaining !

Sorry, I forgot to mention, I'm working with 16S rRNA, region V3V4, using illumina MiSeq reads 300bp (2x300)

I'd like to know your recommendation.
Should I change parameters or should I use only the forwards?
It is a marine sediment study.

The thread I linked to you earlier should help with that. Another couple of good threads that'll help are:

Determining the truncation values can be tricky. Given these, and the earlier thread you should be able to estimate a reasonable set of truncation values that satisfies the minimum overlap requirement. For DADA2, the minimum overlap is ~12 nucleotide bases.

That is, the point of truncating is to remove the low-quality bases and reduce mismatching base pair collisions we discussed earlier. So, as long as you have enough for a minimum overlap you should be good to go. If you are unable to retain an acceptable number of merged reads per sample... then you can consider processing only the forward reads.

1 Like

I appreciate the help and the fast responses!

I'll read those threads and rerun my analysis.
I hope this help someone else

Thank you @SoilRotifer !

1 Like