Question about DADA2 filtering and chimera detection

Hi everyone, i was performing 16S 2x300 analysis and i have a few questions about DADA2 filtering and chimera detection.

I've cleaned adapters with cutadapt using this

cutadapt
-g CCTACGGGNGGCWGCAG
-G GACTACHVGGGTATCTAATCC
-Q 0
--discard-untrimmed --pair-filter=any
-m 50
-j 30
-o "cleaned_4_20/{sample}_R1_trimmed.fastq.gz" \ -p "cleaned_4_20/{sample}_R2_trimmed.fastq.gz"
"$r1" "$r2"

Then i merged them with PEAR and then i did the trimming with fastp.

pear -f cleaned_4_20/{sample}_R1_trimmed.fastq.gz \ -r cleaned_4_20/{sample}_R2_trimmed.fastq.gz
-o cleaned_4_20/${sample}_merged
-p 1.0 -v 10 -j 25

fastp -i cleaned_4_20/{sample}_merged.assembled.fastq \ -o cleaned_4_20/{sample}_assembled_trimmed.fastq
-l 35 --cut_mean_quality 20 -5 -3 -w 12

I've seen that the 2 peaks of lenght in the lenght distribution graph is around 400+ , so I've imported the merged reads and then i ran:

qiime dada2 denoise-single --i-demultiplexed-seqs import.qza --p-n-threads 0 --p-trim-left 0 --p-trunc-q 0 --p-trunc-len 380 --p-chimera-method pooled --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza

But i have some doubt:

  • Why does filtering happens here if i put --p-trunc-q 0?
  • Also, why do i get such high number of chimeric sequences?

You can find everything in here:
rep-seqs.qzv (540.5 KB)
denoising-stats.qzv (1.2 MB)
summary_import.qzv (297.2 KB)

Thank you for your time and help!

1 Like

Hi @Phoe

On the filtering, the truncation filter excludes any sequences shorter then 380, so you may check how many there are in your initial sequences.
On the number of chimeric sequences you may try to change the chimera detection method, to see if it gives better results.

That said, I would like to highlight the fact that dada2 it is not suitable for pre-merged sequences because the quality scores are defined by pear in your case and dada2 requires the original quality scores. You can pre-merge your sequences if you like, but in this case you should use deblur plug in for the denoising step. Or just use the dada2 pair command and the cutadapt output.

I hope it helps
Luca

2 Likes

Hello! Thank you for your answer.

The sequences excluded are not so much, but i was expecting to lose some of them (just not this much).

I'm curious on the fact you mentioned about "original quality scores" and I'd like to know more about it. What you mean by that and how the "trimmed" reads may influence dada2?

1 Like

Hi @Phoe
Sorry I was not so clear. The trimming does not alter the quality score, neither the cutadapt adapter trimming nor the trimming setting within dada2.
However, the merging step does alter the quality score. The quality score for the sequence merged by pear is derived by aggregating the quality score for read 1 or read 2 (it may be just the sum of them but I don't recall the exact way pear works at the moment), but for sure the quality score for the overlapping sequences is different from the quality score for this region in the separate R1 and R2. This is enough for invalidate dada2 expectations!

I hope it makes more sense now
Luca

2 Likes

Thank you @Ilenzi !

I imported my reads and removed adapter+primer using cutadapt inside qiime2.

qiime cutadapt trim-paired --i-demultiplexed-sequences import.qza --p-cores 30 --p-front-f CCTACGGGNGGCWGCAG --p-front-r GACTACHVGGGTATCTAATCC --p-match-adapter-wildcards --p-discard-untrimmed --o-trimmed-sequences import_cutadapt.qza

Now, my quality looks like this:

summary_import_cutadapt.qzv (322.7 KB)

Would you please guide me to where should i trim my reads in order to get the overlap? Could --p-trunc-len-f 257 --p-trunc-len-r 218 work or am i being too soft?

Hi @Phoe,

Ok well done! So we can start from here!

Before answering, can I ask you if you can open a new thread? Please because this question is now out of topic!

Now, to answer your question. A topic of interest may be the following:

So you can have the general idea on what to do. I do think that the trimming option you propose are very permissive, and I usually go stricter (such as -p-trunc-len-f 245; -p-trunc-len-r 180). However, given that the quality of your reverse read drop so early, you would loose the overlapping region completely (please note dada2 requires at least 12 bp of overlap). So you may start with your parameters and see how many sequence you get at the end of the dada2 process. Then try more stringent settings. This try and repeat is very normal with this type of analysis so I would not expect to get it right at first try!
In worst case, you may savage your data by using forward read only (and dada2 single as well!)

Hope it helps
Luca