Can someone explain this to me like I'm five please? It has been years since I have done any 16S analysis and I'm still trying to shift from OTU to denoising.
I will have 900 16S samples split across 5 runs (only have 1 batch right now). We're using primers F515/R806.
I used cutadapt prior to converting to qiime format. Trimmed the primers/adapters and poor quality bases. All my reads are 250bp (apart from under 2% of forward reads are 70bp). My thinking with this was that I can run the same cutadapt command across all runs and then don't need to faff around with the trunc length in DADA2 (this may be completely wrong).
However, now looking at the forums I think I should have just input my raw reads into qiime2 and only trimmed the 19 bp off each end for the primer then straight into DADA2 for denoising. I am going round in circles trying to think of the best way. I'm sort of on my own with this so can't ask any colleagues what they would do.
Any help would be greatly appreciated.
I assume that you mean 5 sequencing runs, right? Or 1 batch - 1 sequencing run?
You can use cutadapt inside of qiime2 as well.
Technically, it is possible. However, it has some disadvantages - you can artificially inflate diversity metrics because primers don't necessarily have the same starting and ending positions, and just trimming the first 19 will produce different ASVs on otherwise identical sequences.
To summarize:
For each sequencing run (with the same settings!)
run cutadapt to remove primers
run dada2 (again, same settings)
After it, you can merge feature tables and representative sequences. Then you can proceed with merged files (taxonomy, diversity).
Hi,
The 900 samples are being sequenced in batches of 5. Whenever we have enough DNA to make up a lane we send it off. Just submitting them for sequencing on a rolling basis.
Thank you for the explanation. I've gone back to my raw sequences and ran qiime cutadapt. Do I need to use the same --p-trunc-len each time I run dada2 with the next batch?
I would try my best to find one parameter that will work with all runs. However, some deviation can be allowed - if reads overlap then small differences are not impacting the ASVs.
But I just play on the safe zone and use absolutely identical settings that work for all runs.
I am in a similar situation with samples from an ongoing project being sequenced in multiple batches. I used dada2 for each sequencing run and then merged afterwards. I'm wondering about the recommendation for denoising and dereplicating with the same settings for each batch. If the dada2 error models (and the issues that they are attempting to correct) are unique to each run, shouldn't the inputs also be tailored to each run? In my case, there isn't any issue using the same inputs, just curious why this is recommended
That makes sense if you are not going to compare samples from different runs.
There are two main issues when comparing different runs.
Each run has unique error profile. That means that when denoised together, error model will be biased towards the bigger one, resulting in the not so good performance regarding the smaller one. It is why it is recommended to run each separately.
ASVs are unique DNA sequences. That means that even 1 nt difference will separate sequences into different ASVs with different ids. Not identical settings for cutadapt and Dada2 increase chances of non-biological but technical separation of ASVs from different runs. It will cause inflation of diversity metrics and separation of samples into different clusters based on the batch.