Back again with another head scratcher for myself. I have a new set of data for a low biomass microbiome. I sequenced V3 (CCTAYGGGRBGCASCAG) - V4 (GGACTACNNGGGTATCTAAT) and received back my sample data (~100 samples). I have run this before so was using my previous code and just making updates where needed.
Firstly, I am importing with a manifest file using (qiime tools import) and then trimming off the primers (qiime cutadapt trim-paired).
When I check the demux summarize file the sequences look good (right?)..
I can see that everything is getting filtered out which is why the codes finishes so quickly...
but my sequences looked to be good quality and I am not over or under-truncating (am I?). The trunc values I choose are the bp before the median Q score drops below 30. So why would it drop almost all the sequences.
This data is from a low biomass microbiome and I do seem to have a very high amount of sequences back from sequencing but I would expect those to drop in later step.
Question: What exactly happens in the filter step? I assumed this just means anything over my chosen truncation number would be dropped. Is there something behind the scenes I am not aware of? Can anyone spot my mistake? Should I not be assuming my data is good quality just because I have a good Q score for most of it?
Hi @Labm
I am not sure what happened with your dataset on filtering step, but could you try to decrease --trunc-len-f value? For example, set it to 280 or lower and see if it will affect the output, since DADA2 will discard sequences that are shorter than this value. I would play with this parameter for forward and reverse reads if necessary to choose optimal ones.
--p-trunc-len-f INTEGER
Position at which forward read sequences should be
truncated due to decrease in quality. This truncates
the 3' end of the of the input sequences, which will
be the bases that were sequenced in the last cycles.
Reads that are shorter than this value will be
discarded. After this parameter is applied there must
still be at least a 12 nucleotide overlap between the
forward and reverse reads. If 0 is provided, no
truncation or length filtering will be performed
Thank you for your reply Timur. Much appreciated. Since posting I started a run with a very small overlap. Assuming my expected amplicon is 470, then upon removing the primers the amplicon would be 433. Then I need a minimum of 12 bp overlap so I was aiming for 445 bp.
So for this run I chose --p-trunc-len-f 275 and --p-trunc-len-r 170. It is still running (thankfully) so I don't know the results yet and will post them when it finishes.
Position at which forward read sequences should be truncated due to decrease in quality. This truncates the 3' end of the of the input sequences, which will be the bases that were sequenced in the last cycles.
Okay that makes sense, sequence quality gets worse when sequencing longer amplicons. This is what I assumed was happening.
Reads that are shorter than this value will be discarded.
Based on your reply and this sentence, it sounds like I have been thinking about the sequences and truncation value the wrong way. I always thought that if you had better Q values then you should truncate less of your sequences, so the merging would be more accurate. But is it actually better to truncate as much of your sequence as possible? Because if you select a trunc value that is larger, then your smaller sequences will be discarded.
I didn't realise sequences shorter than your chosen trunc were being discarded, I thought if they were shorter they would just not be truncated but still remain - what is the reasoning behind this?
As a quick follow up, this is my output after changing the trunc values described in my previous reply. I am concerned because the command still finished much quicker than expected (~4 hours) and as it known that dada2 can really take a long time, this is a bit of a head scratcher (cpus-15; gb:65 used). I would appreciate any tips on analysing this output (only showing the top half but similar trend throughout). There were a lot of sequences removed but I would say I had more than expected to begin with (low biomass, human sample), so perhaps this is why.
Sorry to step on your toes, @timanix! Just thought I'd address this small part of the question.
DADA2 is much faster than it was a couple years ago. My runs dropped from multiple days to ~1 hour a while back. Your denoising stats are a much better indicator of outcomes than run time.
Thanks Chris for the reply. To clarify, the last time (and first time) I ran DADA2 was only a year ago. However, really good to know I shouldn't be worrying about the journey, only the outcome in this case haha. I am wondering if the reason it took so long in my last data set was because I didn't truncate my sequences much - so I made it harder to merge sequences, not easier as I thought. I think I will have to go back to that old data and try a much smaller overlap and see what the stats say to understand this step better.
@Labm, how much variation in sequence length do you expect from your samples? Fungal ITS sequences, for example, can be highly variable in length, but for some common bacterial 16s regions, length is pretty consistent, varying by maybe 3-6 bp, with the majority of sequences at the same length exactly (at least in the GI samples I've worked with).
If sequence length variation is minimal in your samples, as it often is with my data, you can probably set yourself a lower bound for the number of positions retained after filtering of expected amplicon length + 12 (overlap) + 1/2 expected length variation. So long as you're not biasing your data by systematically dropping shorter reads, or breaking things by having forward or reverse strands longer than your expected amplicon length, you have a lot of flexibility to set the truncation parameters wherever you get the best retention.
There's a relatively detailed section on this in the paper - search for the word "Filtering", or better, read the whole thing - it's worth understanding.
I'm not exactly sure why we "drop reads shorter than the truncation length", but maybe we can hash it out together. Let's take for example a 2x150 run of an amplicon we expect to be exactly 250 bp long. Presumably, a successful forward or reverse read would be 150 nucleotides long. We want to truncate some low-quality positions, taking 10 off the 3' end of the forward and reverse reads.
If we expect our amplicons to be much longer than 150 bp long, what does it mean if these f/r reads are only 130 bp long when they come off of the sequencer? Why would that happen, and are they trustworthy? I don't know enough about the process to answer these questions, but my gut says there may be issues at play for these sequences during the extension period of PCR.