I encountered some problems when running dada2. After dual-end data import, some reads were few while others were many after dada2 processing, and the sample size was reduced from 217 to 180. For this reason, I kept all the sequences when de-noising, but the result is still incorrect. What is the reason?
What do you mean by "the sample size was reduced"? You lost all reads from certain samples? What do you mean by "the result is incorrect"? Some results may be less desirable than others, but outcomes usually aren't deemed correct or incorrect.
Can you attach your demux.qzv and dada2-stats.qvz?
Thank you for your question. Before the denoising of dada2, I had 217 samples, but after the quality control and denoising, the number of samples was only more than 100, and some samples were all filtered out. Furthermore, some of the remaining samples have only a few thousand reads, while others have tens of thousands. Is this the reason why my parameters are not set correctly? My argument is
demux.qzv (315.0 KB)
denoising-stats.qzv (1.2 MB)
Reviewing denoising stats, what I see is that most of the samples show normal passage of reads through each step, but about 1/8 of the samples have all reads removed at the filtering step. That is not something that is ordinarily observed. This strongly suggests there is something about those 1/8th of the samples that is causing them to fail the bioinformatics.
I would suggest looking into the subset of samples that drop to 0 at the filtering step, and asking what characteristic they might have that separates them from the majority of samples that proceed as normal.
Is there a biological difference? Is there a technical library preparation difference?
Thank you for your answer! This is the first time I used qiime2 for analysis. After this problem occurred, I set the parameters --p-trunc-len-f and --p-trunc-len-r to 0, that is, I did not trim my sequence and directly carried out dada2 denoising. The result was normal and the number of samples did not decrease. The minimum number of reads is more than 30,000. Although there seems to be no problem with the output results so far, I would like to ask why the number of samples and reads are filtered directly? Do you know the specific reason?
Filtering is and sequence truncation is performed primarily to improve the quality of the sequences. In particular, in Illumina sequencing the quality often falls off substantially at the ends of reads (particularly reverse reads) and so it is beneficial to truncate before the quality crashes (while maintaining enough read length to overlap paired-end reads).
As to why samples can get dropped, this happens if all reads in the sample failed the filtering step -- i.e. there were zero reads left in the sample after filtering.
Thanks for your answer, now I can understand that this may be the correct use of dada2, when my sequence quality is high enough, I do not need to filter and truncate the sequence, only filter and de-noise when the sequence quality is not good