Hi there, I was wondering if an initial q-score filtering to my paired-end dataset prior to dada2 denoise is needed.
To understand the difference with or without the initial q-score filtering, I performed two parallel processing with my data: 1) initial q-score filtering and then the dada2 denoise; 2) straight to dada2 denoise (no initial q-score filtering)
So my questions are:
The first processing method with q-score filtering dropped the reverse read, making my data becoming single-end instead of paired-end. Why did this happen?
By comparing the two processing methods, I noticed that the number of features in method 1 are the double of that in method 2. I suppose with two filtering (q-score filter + dada2 denoise) should filter out more sequences and features then just one filtering (dada2 denoise). Is there something wrong with my steps?
I also noticed that some of the features are quite low in frequency (i.e. around 2 -20) and are not observed in many samples (i.e. just 1 or 2 samples). Given the sample size = 92, do I need to filter these features out before further processing to get a more valid results.
Much appreciated if anyone can provide any help / advise!! Thank you so much!
Welcome to the forum! And great questions. Here are some comments though certainly not an exhaustive answer.
Assuming you used the quality-filter q-score plugin, you’ll notice that this plugin can take either SampleData[SequencesWithQuality], PairedEndSequencesWithQuality], [SampleData[JoinedSequencesWithQuality] but only outputs SampleData[SequencesWithQuality], meaning it is operating on the forward reads only. This plugin was primarily designed to work with q2-deblur so it really fits the needs of that tool which only operates on single-end reads. For DADA2, this initial quality filtering is not needed, and it may even interfere with the error-model building step of DADA2, although I have played with this myself in the past and found it that the default q-score filtering doesn’t affect DADA2 results all that much. There are various reasons why you don’t need to do this step with DADA2, for example there is already a rather lenient built in q-based filtering, but more so it relies more heavily on a combination of maxEE- and user-based trim/truncating parameters to deal with initial filtering. maxEE filtering > q-score based filtering; Robert Edgar has some nice readings/papers on the topic that you can start from here if you want to dig into a bit more.
Given that now you are not comparing apples to apples (single end vs paired-end reads) I don’t think this comparison is really fruitful. We can sit and contemplate what the differences are for days because they are just so different. Not to mention we would have put your DADA2 parameter selection under the microscope too, there is just a lot of variables to account for here.
That really depends, what those sequences are and how confident we are in their presence in your samples, and also what type of analyses are you performing downstream. For things like differential abundance testing those rare features don’t provide any useful information and just add noise so are often filtered, but for some analyses like alpha/beta diversity they can be important.
I personally use a positive filter first that does a good job of getting rid a lot of those weird looking and rare sequences (meaning they were likely off-target sequences) and then I usually blast some of the remaining rare ones to see if they hit the expected target region or are something else completely. Then make a choice based on that. There’s no right or wrong way to go about this and everyone has their own thoughts and methods, so you just have to pick one that best suits your question.
Hope this gets you started. Keep up the good questioning!