How to remove primer with different length

Lei · May 17, 2019, 2:41pm

Hello Mehrbod,

Thank you so much for your extensive explaination @Mehrbod_Estaki ! It solved most of my queries! I did run the DADA 2 analysis and have some issues with the RAM as I post here. After increasing the RAM，I did go through the DADA 2 successfully and finished the some basic analysis followed the "moving picture" tutorial used my own data.
I did try two different run:
Run 1:–p-trunc-len-f 0
–p-trunc-len-r 230
Run 2:–p-trunc-len-f 220
–p-trunc-len-r 200 (As you suggested)

I did not get good results for run 1. After run 1, around half of my samples had less than 500 reads!
I did run 2 followed the parameters you suggested. I got reasonable results after DADA 2 denoised. All the read number make sense to me now. I want to say that always make sure to remove all the bad quality data. Don't try to keep them. If you do not remove those, you will lose more reads after denoised step .

Now I still have some follow up questions want to ask:

When I ran DADA2 using the parameter in run 1, there are message "all the sequence are not the same length" continuously appeared. But I think they are not error message and they won't affect the process? So my questions here is that Is this message just trying to inform me that I have different length in the sequence. This different length won’t hurt the denoised step??
The second questions is related to the sequence read. One of my sample had very low sequence read around 2000 (all other samples had >10000) in the raw fastq data. In my run, I still keep this sample and put it into DADA 2 analysis. I understand that DADA2 seems to use some machine learning algorithm to train the data. So I am wondering whether there are differences keep this sample with low reads to run DADA2 and remove this sample from the entire dataset before running DADA2? I think if DADA2 process everything individually, this sample with low read won't affect the results??
Regarding the sequence read, in your opinion, what should be threshold ? My previous colleague told me that he usually discard the samples with sequence read < 5000. Do you think this is a reasonable value?

I saw Mark in this post compared the difference among DADA2, Deblur, and Vsearch method using Qiime2. I also plan to this comparison using my own data in the next and post my follow up.
Many thanks