As part of switching from QIIME to QIIME2, I've been reanalyzing some of our older data-sets in QIIME2.
No matter which data-set I choose, there is a huge number of reads (85% to 90%) that are being dropped at the filtering and merging steps. I've read the following thread, and I understand what is being suggested there.
Maybe I'm just not using the right combination of parameters in Dada2, but I feel I should be able to merge more than just 10% of the reads, considering I was always able to merge >85% of the reads using appropriate quality filtering (sickle) and read merging tools (panda) in the past. I've tried different levels of cut off from 200 bp (too short) all the way up to no cut-off at all and always ended up with very few merged reads.
We are using the V3V4 region of about 460 bp and have 2x300 bp reads. I attached the stats.tsv and input qzv files.
Thanks for providing us with your reports, makes troubleshooting much easier.
So it's very obvious from your stats report that the issue is originating from the quality filter step where you lose most of your reads.
One thing I notice is that when I hover the mouse over your quality plots, as early as the 38bp position you have less than 10,000 reads which suggests that the majority of your reads are actually much shorter than we expect. So my initial question is have these reads gone through any sort of quality-control/trimming prior to importing into qiime2? If so, can you import the non-QCed ones instead? This is what DADA2 expects.
As for your truncating parameters the rule you want to follow is to make sure at least 20bp overlap exists for proper merging (I would say up to that 30bp to take in account for natural variation) which means your total truncating should not be more than 110 in total (2x300bp - 460bp amplicon + 30bp overlap = 110bp) between both directions. Unfortunately looking at your reverse primers, the quality seems to drop rather early which in my opinion will make it harder to merge properly. In that case I think using the forward reads only (which seem to be in much better shape anyways) would be your best bet. All this is AFTER figuring out why so many of your reads are shorter than expected.
Deblur works on one direction of reads only, meaning you can provide it paired-end reads but it will simply ignore the reverse reads. If you merge the forward and reverse reads prior to deblur, let's say with q2-vsearch, it will denoise the full length. I should mention however that in this method the high quality scores of the overlap regions (as produced by consensus methods) are wasted since deblur doesn't use those scores, instead uses its own error model which assumes that the quality scores continuously reduce as the length increases.
I used the rawest possible reads for this analysis. Basically directly off the MiSeq without any processing. Looking at the FastQC report, the vast majority (>99.9%) of the reads are exactly 300 bp long. There are a very small number of reads that are around 40-49 bp. If I remember correctly from when I was analyzing this data, they mainly consisted of series of N and I assumed they were just a sequencing artifact. Then there are also a very small number of reads that are between 220-229 bp, which are probably sequenced primer dimer. I'm not sure why they would be over-represented in the quality plot though. Should I remove shorter read pairs before going into Dada2?
The run I used is relatively old and one of our worst runs quality wise. I've just had a look at some of our data from last year and it looks a lot better. Will give that a shot and see how I go.
This is going a bit beyond my original question... We've only ever used V3V4. Is it acceptable to use V3 only?
Sounds like Deblur is not the best option either for this.
Hopefully this shouldn't be a problem in our future research, if we can keep up the higher quality.
I need to backtrack a bit here, you're right that the majority of your reads are 300bp. I had multiple QC plot tabs open (was looking for similar examples to your case) and I must have looked at the wrong one before my reply.
But your observation regarding a series of N is curious
Good! Just make sure your barcodes/primer adapters are removed though, only biological nts should go through DADA2.
This bit is interesting, any more details on this? Or better yet if you could provide us with an example of these reads (or if you rather send the files directly, that works too)
Not needed as DADA2 automatically will discard shorter reads that don't match the truncating value you set.
A couple of more suggestions in troubleshooting this run (even though it sounds like you may have given up on it anyways ) First try running just the forward reads and relaxing the --p-max-ee DADA2 parameter, let's say to 5 instead of the default 2 to see if that helps. Also on the forward read be very conservative on your truncating value say set to 180. See how many reads pass the filter then. It's possible that the issue is mainly coming from the reverse reads and the correspond forward reads are tossed since we are running them as paired-end.
If you're asking about which primers set to use in your PCR then there's plenty of literature out there covering this topic. There's technically nothing wrong with using just the V3 region instead of V3-V4, it really comes down to your experiment and your goal. I will tell you that all primer sets have certain biases towards certain taxa so if you are looking for a particular taxa it might be worth looking through the literature carefully for those as well.
Deblur actually does very well in denoising, see this comparison paper for comparison of different denoising methods. I personally have just found it a bit too conservative when dealing with longer reads, but with shorter reads it does exceptionally well and is much faster than dada2 as well.
Hope this helps a bit!
I had a closer look at the raw data. Turns out the majority of the rare ~45bp fragments consist of the forward and reverse primers, plus a very short sequence in between (see attachment for a small selection). There is a very small number of just Ns. Not quite sure where the Ns are coming from. Maybe it's adapters with barcodes (hence the sequence being demultiplexed), with gibberish or nothing in between, but then, that would not explain how the cluster is identified and phased during the initial cycles of the run... hmm....
I'm always eager to learn more! And there are quite a few older data-sets that have this problem, which we haven't published yet. So I'm grateful for the additional suggestions. In my mind, I can't shake the thought that it's always better to go for the longer sequence for better taxonomic resolution, but I guess it can be better to go for a shorter sequence with higher quality and read coverage.
Actually, something you said in your initial reply made me rethink my trimming approach. So far, I've only used the same cut-off for both F and R. But you mentioned this:
So when I started using different cut-off values for my F and R reads (e.g. in this case 270 and 230), my final reads pretty much doubles from around 10-15% left to 25-35% left. Still not great, but better. And thinking about this now, this makes total sense, since the quality drop off of the forward read occurs much later, and this way I can maximize the overlap. I wish there was an inbuilt tool to quickly iterate through different combinations of F and R trimming, but lacking that I might just write a small script that can do that on a subset of my samples.
I was more asking about having sequenced V3V4, but only using V3. But in the end, you've answered what I wanted to know. I think one common problem is that we often don't have the time/money to extensively test which primers work best, so we default to the same primers we've always used in the past. Time to dive into the literature some more!
Hi @Roger_Huerlimann,
The short sequences are odd, and they could be a number of things, probably not worth losing sleep over since as you suggest they are rare and also DADA2 will just discard them anyways. Especially with N values, who knows what they can be or where they came from
This is a very common belief, one that I personally was too obsessed with but the more data I go through the more I'm aligning myself with quality over quantity. Besides, the difference in resolution is probably not as much as you think it is. See Fig1 of this Wang et al. 2007 paper.
That's a good start! Though I think its still a bit on the lower end side of things. The important bit of info would be to see how many reads you retain using only the forward reads. That should give us a good idea of how many reads we should expect and the difference we can peg on issues with using paired-end.
Now that'd be a handy tool to have! Let us know how that works out for you, I'm sure it would be of use to whole community if it works.
If you have good quality longer (V3-V4) sequences I say use those and don't bother trimming them shorter, unless you have a reason to, say to compare them to other V3 data, even then we now have some more useful tools than discarding data (i.e. q2-fragment-insertion)