Hi, I have seen many posts about this subject on the forum. But saw rthat there where mostly three problems: Primers not removed/ primers removed before filtering on read length/ low quality data.
I did control for these steps, but still I loose 60-85% of my reads. Which is way more than was shown in the percentages table shown in one of the earlier posts ( sorry cannot find it back, so I can't link it anymore) .
So I hope someone can help me with what I do wrong or can change?
I am denoising gut microbiome data 16S V3-V4 with 250bp .
primers: 341f (CCTACGGGNGGCWGCAG) 785/805r (GACTACHVGGGTATCTAATCC)
Welcome to the forum!
I think the problem is in the truncating value for forward reads:
You need to set forward reads to a lower value for truncation or disable it since all the reads that are shorter than 250 are filtered out (check 'percentage of input passed filter' column).
Since you are working with v3-v4 region, that is quite long, if in the stats after setting lower truncation for forward reads you will loose a lot of reads on the merging step, you also can decrease minimum overlap parameter and/or completely disable truncation.
denoising-statsHBSses2.qzv (1.2 MB)
That is this oneI Only disabled truncation on the forward reads, but the one on the reverse stayed at 232. Trim parameters stayed the same.
I also looked at my length summary in the demux file and it says all my reads are 251.
Thank you for sharing your stats.
I think that percentage of features pasted through the filters is relatively low due to high probability of errors in filtered out reads. At the same time, as @jwdebelius (thanks!) noticed, you still have more than enough of the reads to proceed with the analysis.
Because of the natural variation of V3-V4 length among bacteria, I would like although to suggest to decrease minimum ovelapping region parameter to recover more reads on the joining step.
Eventually the first option I used gave the best results. I will perform he analyses on the remaining reads ( which are indeed a lot ) Still I'm curious though about the reason why the reads are discarded in the filtering step. Can this be introducing a bias? is this becaquse I had such a high amount of reads to start witrh, that it included many duplicates? etc
Sorry for a long silence.
This new quality plots based on NovaSeq-like are always confusing me. But one can see that there are some declines in the quality scores starting from the middle of the reads in both orientations, and I think it is a main reason for the loss in the number of reads passing the filter. It is only my opinion, but probably the reason for it is on the sequencing center side or somehow were caused at the PCR step (I had one plate that gave me a lot of reads with low quality meanwhile samples from other plates sequenced at the same run were fine. It is very strange to me).
In any case the fact that filters removed those reads will lead to less bias in the data than if they would remain in the analysis.