I have a question about which read lengths to accept and which to filter out.
The data i have is 4V 2 x 250 paired-end Novaseq (so a lot of reads).
The overlap should be about 210 bp and total merged read length after denoising (dada2 denoise-paired) should be about 250 bp excluding the 19 bp forward and 20 bp reverse primers (F 515 R 806).
I gave this command:
qiime dada2 denoise-paired --i-demultiplexed-seqs WesternDietSertpaired-end-demux.qza --p-trim-left-f 20 --p-trim-left-r 20 --p-trunc-len-f 0 --p-trunc-len-r 0 --o-table tableWesternDietSert.qza --o-representative-sequences rep-seqs-WesternDietSert.qza --o-denoising-stats denoising-statsWesternDietSert.qza --verbose
It gives me a large range of read lenghts 69-450 with a mean of 239
Now which reads should i filter out? i could use only the 25-75% interval 247-252 read length? This way i loose 50% of my data but i take only the closest match.
Something that puzzles me is that i had previously mistakenly truncated 20 bp forward and reverse besides trimming 20 bp forward and reverse. In this setting, despite cutting 20 bp of biological signal, my read statistics look more consistent with the range being 210-408 and the 9-75% interval being 249-252. I would have thought that giving more of the reads would improve read statistics. --p-trim-left-f 20 --p-trim-left-r 20 does mean i cut the 20 bp at the beginning of the forward and reverse reads right?
and --p-trunc-len-f 0 --p-trunc-len-r 0 is intended for cutting the end of the sequence, that i have now opted out of.
With only trimming 20 bp the beginning i now have many more features 4040 vs 2814, so that seems to have improved…
Note that the q scores look good so (above 30) so i don’t have to cut anything at the end. I did it because i thought there was a reverse primer at the end of the read but there is not.
Thanks for your advice on filtering out read lengths.