Hi all,
FYI, I am a beginner in bioinformatics. I am using QIIME2 (2024.2) in a conda environment. I am working with 147 samples of AMF rSSU amplicons of a median length of 469bp (should be 550bp in theory). I had a max of 448bp and a min of 281bp. Out of my 147 samples, I have 5 replicates/treatment combination.
I removed the sequences beginning with my primers sequences (WANDA & AML2) with Cutadapt, but a few samples still had them even after cutadapt. For 4 samples, 124; 147; 233; 337 sequences were found beginning with the forward primer sequence. About 45 other samples have 1-3 reads beginning with the forward primer sequence. No sample has any reads beginning with the reverse primer sequence.
My quality plots looked very good as I benefited from the new MiSeq i100 of the sequencing platform I dealt with. See my quality plots photo attached.
So I decided to not trim or trunc. After denoising with DADA2 with these settings:
qiime dada2 denoise-paired \
--i-demultiplexed-seqs demux-trimmed.qza \
--p-trunc-len-f 0 \
--p-trunc-len-r 0 \
--p-n-threads 0 \
--p-max-ee-f 2 \
--p-max-ee-r 3 \
--o-table table.qza \
--o-representative-sequences rep-seqs.qza \
--o-denoising-stats denoising-stats.qza
I end up with very variable % of input that are non-chimeric: between 5.59 and 69.8% (with a median of 21.17%). Filtering only removed 0.1 to 4% of the inputs. It seems like the bottleneck is the merging of my reads (pair-end). I’m loosing between 5% and 91% (median of 68%). For the sequences that merged well (<20% lost after merging), I loose about 30-40% when removing chimera to end up with good final % of input kept (between 40 and 70%). Those might only represent a third of my samples.
The count of reads kept after all those denoising filters are between 1,278 and 99,385 inputs with a median of 26,607. Only 6 samples are <10,000. I would probably get rid of those with sampling depth later (assuming it makes sense based on rarefaction plots).
My questions are:
-
Do you spot any beginner mistake that I am oblivious to?
-
Should I be concerned with the low % of reads that passed the filters of DADA2 even knowing that I am working with root samples that were inoculated with low AMF abundance/diversity?
-
Should I modify my pipeline or any settings to increase the retention of reads?
-
Do you need more information to answer the previous questions? I am happy to provide more.
Thank you for your input and your help, it would be much appreciated.
Jérémie
