Low % of reads pass merging and chimera filter after DADA2 for rSSU (Fungi)

PoitrasAMFresearch · September 29, 2025, 9:06pm

Hi all,

FYI, I am a beginner in bioinformatics. I am using QIIME2 (2024.2) in a conda environment. I am working with 147 samples of AMF rSSU amplicons of a median length of 469bp (should be 550bp in theory). I had a max of 448bp and a min of 281bp. Out of my 147 samples, I have 5 replicates/treatment combination.
I removed the sequences beginning with my primers sequences (WANDA & AML2) with Cutadapt, but a few samples still had them even after cutadapt. For 4 samples, 124; 147; 233; 337 sequences were found beginning with the forward primer sequence. About 45 other samples have 1-3 reads beginning with the forward primer sequence. No sample has any reads beginning with the reverse primer sequence.

My quality plots looked very good as I benefited from the new MiSeq i100 of the sequencing platform I dealt with. See my quality plots photo attached.

So I decided to not trim or trunc. After denoising with DADA2 with these settings:

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux-trimmed.qza \
  --p-trunc-len-f 0 \
  --p-trunc-len-r 0 \
  --p-n-threads 0 \
  --p-max-ee-f 2 \
  --p-max-ee-r 3 \
  --o-table table.qza \
  --o-representative-sequences rep-seqs.qza \
  --o-denoising-stats denoising-stats.qza

I end up with very variable % of input that are non-chimeric: between 5.59 and 69.8% (with a median of 21.17%). Filtering only removed 0.1 to 4% of the inputs. It seems like the bottleneck is the merging of my reads (pair-end). I’m loosing between 5% and 91% (median of 68%). For the sequences that merged well (<20% lost after merging), I loose about 30-40% when removing chimera to end up with good final % of input kept (between 40 and 70%). Those might only represent a third of my samples.

The count of reads kept after all those denoising filters are between 1,278 and 99,385 inputs with a median of 26,607. Only 6 samples are <10,000. I would probably get rid of those with sampling depth later (assuming it makes sense based on rarefaction plots).

My questions are:

Do you spot any beginner mistake that I am oblivious to?
Should I be concerned with the low % of reads that passed the filters of DADA2 even knowing that I am working with root samples that were inoculated with low AMF abundance/diversity?
Should I modify my pipeline or any settings to increase the retention of reads?
Do you need more information to answer the previous questions? I am happy to provide more.

Thank you for your input and your help, it would be much appreciated.

Jérémie

cherman2 · October 14, 2025, 10:17pm

Hi @PoitrasAMFresearch,

These are not beginner mistakes but I think that the AMF sequences are a little tricky so lets discuss some issues and steps forwards!

I personally would spend some time messing with parameters to get better % of sequences through. We would expect for their to be a smaller number of sequences since its low abundance but I would expect that you still get a higher percentage though the filter.

@SoilRotifer pointed out to me that the more your reads overlap the more chance there will be a mismatch, which will contribute to merge failures if the sequence hits the mismatch threshold. He recommends trimming 20-30 bases regardless of quality.

I would be point out that there is a chance that your AMF SSU region is too long to be covered by 2x300 seequnecing and in those cases there is nothing to do about the failure to merge.

I would mess around with truncating and see what that does for your % of sequence passed. I personally only like to mess with one parameter at a time. Once you have that sorted and have maximized the amount of sequences that are merge-able, I would look over this post to address your chimeric detection issue:

I hope this helps and let us know if you have any follow-up questions.