I have a question about which read lengths to accept and which to filter out.
The data i have is 4V 2 x 250 paired-end Novaseq (so a lot of reads).
The overlap should be about 210 bp and total merged read length after denoising (dada2 denoise-paired) should be about 250 bp excluding the 19 bp forward and 20 bp reverse primers (F 515 R 806).
I gave this command:
qiime dada2 denoise-paired --i-demultiplexed-seqs WesternDietSertpaired-end-demux.qza --p-trim-left-f 20 --p-trim-left-r 20 --p-trunc-len-f 0 --p-trunc-len-r 0 --o-table tableWesternDietSert.qza --o-representative-sequences rep-seqs-WesternDietSert.qza --o-denoising-stats denoising-statsWesternDietSert.qza --verbose
It gives me a large range of read lenghts 69-450 with a mean of 239
Now which reads should i filter out? i could use only the 25-75% interval 247-252 read length? This way i loose 50% of my data but i take only the closest match.
Something that puzzles me is that i had previously mistakenly truncated 20 bp forward and reverse besides trimming 20 bp forward and reverse. In this setting, despite cutting 20 bp of biological signal, my read statistics look more consistent with the range being 210-408 and the 9-75% interval being 249-252. I would have thought that giving more of the reads would improve read statistics. --p-trim-left-f 20 --p-trim-left-r 20 does mean i cut the 20 bp at the beginning of the forward and reverse reads right?
and --p-trunc-len-f 0 --p-trunc-len-r 0 is intended for cutting the end of the sequence, that i have now opted out of.
With only trimming 20 bp the beginning i now have many more features 4040 vs 2814, so that seems to have improved..
Note that the q scores look good so (above 30) so i don't have to cut anything at the end. I did it because i thought there was a reverse primer at the end of the read but there is not.
Thanks for your advice on filtering out read lengths.
Hi @MirjamBloem! Thanks for the detailed post, this is really helpful.
Taking a step back, what did the raw read lengths look like? Can you share some of the demux summarize results with us? In particular the read length distribution table. I am also curious about the denoising stats from DADA2 - that would be really helpful here, too.
Regarding the above question and unexpected variation in bp length, I wonder if these sequences should be taken care of e.g. by removing them. Is this a common thing to do QIIME2?
Thanks @MirjamBloem - I am going to recategorize this as “General Discussion” - I misunderstood the initial post when I first replied and I thought that you were experiencing a problem, but it actually looks like you’re on the right track, and just need to spend some time getting up to speed on DADA2 - please let me know if I misunderstood. Thanks!
Hi @Joanna, it depends on the target gene and primer set used, but I am not sure I'd describe this as a common scenario. It is certainly possible to filter sequences by length, though! See here:
Indeed i used this "qiime feature-table filter-seqs \ --i-data seqs.qza \ --m-metadata-file seqs.qza \ --p-where ‘length(sequence) > 4’ \ --o-filtered-data filtered-seqs.qza " option to filter the sequences.
Though i wonder
how it can be that when (mistakely) eliminating the last 20 bp using the ‘trun’ setting i have more consistent read lengths compared to not cutting the last 20 bp and
i wonder if where to put the cut-off for trimming the reads? E.g. if you expect a read length of 250 bp, should you filter out anything below 200 and above 300? This way i would filter out 50% of the reads…i.e. imo really too much.
It may also be that very short/long reads will not be classified and filtered out after taxonomy? This would make it not necessary to filter the reads after denoising based on the read length statistics.
@Nicholas_Bokulich thank you for your reply. So, in this particular example V4 region is used, so the expected length would be around 250bp. The percentile (see the picture at the top of this discussion or below this comment) indicates a bigger range than expected. I wonder what would you do in this situation? Would you leave as it is or it is better to filter them out based on the expected length?
That variation is much more than usual, indeed. There is another related discussion somewhere on the forum (I can't find it right now) where it was mentioned that this variation is usually related to non-target DNA... e.g., chloroplast + mitochondrial 16S is shorter than bacterial, and sometimes some eukaryote 18S can be hit by the primers. So unless if this is something that interests you, you should just filter out anything too long/short. I'd say filter anything < 240 nt or > 255 nt
You are correct @MirjamBloem it depends on what this is, but if it is non-target DNA (e.g., eukaryote) it will probably not classify so just removing it now will save you time.
Not sure! Could be that truncating the last 20 nt is causing the unusually long seqs to filter out at that stage (since they no longer join) but I can't explain the shorter seqs.
But it should not be 50% of the data. 240 is somewhere between 9-25% (I'd suspect closer to 9), and 255 is past the 91st %ile, so this is probably throwing out ~25% of the data, which is still a lot but if it's garbage throw it out...
You could also figure out what these are. Maybe keep them in, see if they classify, and try using NCBI BLAST to classify any unclassifieds... the abnormally long/short reads could yield interesting results.