Loosing too many reads in Dada2 denoising step & polyC and polyG tails in NextSeq data

Hello everyone,

I'm working with targeted amplicon data (~460 bp) for 18S, sequenced using a NextSeq 300 bp paired-end kit. However, I'm encountering significant read loss during the DADA2 denoising step.

Issue:

  • Almost all reads are lost during the denoising step.
  • There are polyC and polyG tails present in the majority of the reads.
  • Disabling truncation for both forward and reverse reads slightly improves the number of filtered reads, but all of these are subsequently lost at the chimera removal step.

Steps Taken:

  • I have tried running the DADA2 pipeline without truncation for both forward and reverse reads.
  • Despite this, the issue persists with most reads being lost at the chimera removal step.

Cde:

nohup qiime dada2 denoise-paired --i-demultiplexed-seqs tprimer-demux.qza --o-table table-nt --o-representative-sequences rep-seqs-nt --p-trunc-len-f 0 --p-trunc-len-r 0 --output-dir Denoised-nt &

Any suggestions on how to address these issues would be greatly appreciated. Specifically, I'm looking for advice on:

  1. Handling the polyC and polyG tails.
  2. Improving read retention through the denoising and chimera removal steps.

Thank you in advance for your help!

The outputs are attached for your expert opinion please.


Hi @abdulghafar

Did you trim your adapters/primers from your reads before denoising? That should chop off anything preceding the primer/adapter sequence and DADA2 requires that there is only biological sequence data in your reads, i.e. no adaptors/primers.

I would use cutadapt (docs for cutadapt are here) like so for 16s data:

qiime cutadapt trim-paired \
--i-demultiplexed-sequences paired_end_filtered.qza \
--p-front-f GTGYCAGCMGCCGCGGTAA \
--p-front-r GGACTACNVGGGTWTCTAAT \
--o-trimmed-sequences paired_end_filtered_cutadapt.qza

best,

Vic

1 Like

Thanks a lot dear @buzic , yes I have already trimmed primers and adaptors using cutadapt as follows:

cat 18ssubsample/libnames.txt | while read line; do cutadapt -g GTGACCTATGAACTCAGGAGTCGAGGTAGTGACAAGAAATAACAATA -n 3 -m 1 -G CTGAGACTTGCACATCGCAGCTCTTCGATCCCCTAACTTTC -n 3 -m 1 -o subsampletrim2-18S/"{line}_R1_trim.fastq.gz" -p subsampletrim2-18S/"{line}_R2_trim.fastq.gz" 18ssubsample/"{line}_R1.fastq.gz" 18ssubsample/"{line}_R2.fastq.gz"; done > out-sub2-18S.txt 2> out-sub2-18S.err

I have two separate sets of 16S and 18S data from same samples run at the same Illunina facility and results are same for both i.e loosing almost 90% reads, and polyC and polyG sequences at the beginning and end of almost all merged reads.

I also thought that trimming primers should remove the polyC/polyG but it seems like its not happening. The output file (out-sub2-18S.txt) shows that primers/adaptors have been trimmed from all reads.

Not sure whether it is related to sequencing errors related to sequencer?

Thanks a lot

Abdul

2 Likes

Hi again,

Oh, how strange, I would also assume the primer trimming would deal with it. Did you recieved with Illumina adpters removed? I guess if you say this:

The trimming has worked as explected. It's odd to me that those are inside the primer regions.

I think cutadapt outside of Qiime2 has a specific setting for NextSeq data. This is because the two-colour chemistry is known to produce G tail ends, because the no signal is a G base, so when it runs out of magical chemistry it thinks it's seeing G's. Cutadapt also has a poly-A trimmer, both are explained here.

let us know how you get on,

Vic

3 Likes

Thanks @buzic

I have tried the cutadapt --nextseq-trim=20 option in cutadapt.

It has removed all the polyC and polyG tails but in doing so I am left with just below 0.5% of the total input reads in the final step of denoising (primarily due to reduced length of both forward and reverse reads - leading to no overlapping and no joining). So, to me, it looks like there is something wrong with the sequencing platform and now I am thinking to contact the facility manager. I used all the positive controls and few samples for this trial and all I am left with is 18 sequences of varying lengths (~220 bp to 380 bp) for a region of 450 bp size.

I am looking for your's and other colleagues' opinion for the way forward in this scenario,

Thanks again

Abdul

1 Like