I have a problem with DADA2 (version 2019.4.0) in QIIME2 (qiime2-v2019.04) in conda: the merge results are poor. The region is V3+V4, the primers are 341F-806R. The primers are attached to the sequences. I took 3 samples for a tests. The original demux.qzv file is attached.
I used the following script in QIIME2:
Working with longer sequences --p-trunc-len-f 300 --p-trunc-len-r 289 , gave worse results.
I tried to use DADA2 in R for further possibilities, but it did not help.
I did DADA2 for forward files only, --p-trunc-len 286, and it doubled the recovery, but the sequences were shorter: 266 bp.
Are these results acceptable?
Is the sequencing is OK?
What can I do to improve the results?
The denoising stats look fine except for the chimeras removing, which resulted in the lost of most reads. What’s the length of your forward and reverse primer? Are they all 20 bases? The length of commonly used v3-4 forward and reverse primer is 17 and 21.
In the stat table, the numbers of the last sample are: input 45,586, filtered 37,210, denoised 31,013, merged 18,938, non-chimeric 17,263. the recovery percentage as I calculated is 38%. So it looks like the problem is “merge” and not “chimeras”. The same pattern is in the other 2 samples.
Primers length are probably as you said, but I was told it is not very important. Is it?
Sorry about the confusion. I thought the number in bold was the sequence count after chimera removing. You're right that you lost quite a bit of reads during the merging. As mismatches between the forward and reverse reads after denoising is not allowed, using longer forward/reverse truncate lengths may result in worse results. Your reads are of high quality. Based on the quality plot, maybe you can try –p-trunc-len-f 290, –p-trunc-len-r 261/287(median quality score 30/20)? You'll still have a merged sequence length of ~510/~536 bps.
In my experience, most of samples retain more than 50% of reads after denoising by the DADA2 pipeline. But if samples have really bad read quality towards the 3' end, you may see a huge loss of reads during merging because of mismatches or insufficient overlapping. Below are dada2 stats from one of our study for your reference, but they may differ greatly among different studies.
For the primer length, I assume it's best to trim off the exact lengths as the presence of primer sequence may affect the chimera dectition. The new qiime2 taxonomic assignment method uses machine learning to train the model on the amplicon without primer sequences, so I guess it's best to just trim off the exact primer sequence. But I may be wrong. QIIME2 developers may have better answers.
I'm not sure if 3 samples are enough for the DADA2 to learn the error rate. Maybe also try a higher number of testing samples, say 6?
Losing a large fraction of your reads may not necessarily be a bad thing. Check this comment by Dr. Benjamin Callahan. You can run a rarefaction analysis to find out if the resulting sequences are enough for the downstream analysis for most of your samples. If that’s the case, then you’re fine.
The lost of large fraction of reads is more likely caused by the data than the DADA2 pipeline. To verify that, you can download some similar data with mock, denoise the sequence with the QIIME2 version you’re using and see what you get. If the mock looks as it should be, then it’s not the DADA2’s problem.
As mentioned above, a positive control helps you to find out if there’s something wrong with the bioinformatics. You can test your ideas with the mock sample if you happened to include one in your sequencing run. If not, you may want to go ahead with both approaches and judge the results with your knowledge about your samples. Use the one that makes sense to you.
I compared the best DADA2 results of paired end F+R, and DADA2 results of single end.
The results of BARPLOT were very different. F+R is much more detailed than the F. The distribution of taxa is different.
I do not know what to choose. taxa-bar-plots260.220.qzv (410.8 KB) taxa-bar-plots.F.qzv (331.7 KB)
As the number of sequences used for computing alpha-diversity increases, the observed alpha-diversity increases as well. If a rarefaction curve reaches a plateau at a certain sequencing depth, then it’s an indication that the sequencing depth is sufficient to uncover the taxa composition in that sample. Based on your rarefaction plot, the observed OTUs seem to level out starting from 8000 reads. Thus, 12000 sequences seem to be sufficient for your downstream analyses.
For the taxa barplot, you got very different taxonomic composition. It’s expected that the pair-ended reads gave you a higher taxonomic resolution as the amplicon lengths used for the sequence classification are longer than those of the forward reads only.
However, it’s alarming that the number of taxa and their relative abundances are so different for the pair-ended reads and forward reads only. You may want to check your workflow to identify if there’s anything that has not been done correctly.
The observed OTUs in your samples are quite high. Another thing you can do to help make the decision is to go through the papers describing the microbiota profile of samples similar to yours: how diverse the biota is and what the dominant taxa are. I believe the known findings can assist you to judge the quality of your results.