I am using this pipeline in analyzing my Illumina 2x250 paired end reads (which I received demultiplexed with barcodes and primers removed) for 16s v4. I checked the demux-joined-filter-stats.qzv file but it seems to have retained all the reads. Is this unusual?
Just to rephrase your question: what’s the usual amount of sequences discarded due to quality filtering?
Well, that’s an excellent question and I’m not aware of any actual publications about this; perhaps @Nicholas_Bokulich has see one. Anyway, in our runs we always see some sequences being drop and the amount changes depending on the run quality but in fairness we know that sequencing is not perfect so we always expect this.
Now, I think you could see this due to a few options:
Your sequences are great,
Your sequences are not really Casava 1.8,
You are working with joined sequences and with my limited experience in this step I thinkvsearch actually generates it’s own quality for each of the position of the sequences joined - remember that there are different methods for this: from taking the average, max, min for each position to overwriting to a high value. Perhaps an easy test to discard this is just to import/filter the forward reads and see what happens.
That sounds a bit unusual (I'd expect at least a few very noisy sequences in there) but maybe not. The default q-score-joined uses a fairly low min q-score (4) and even then the reads will only be discarded if they are trimmed shorter than the min length parameter. Those trimmed but unfiltered seqs will probably be tossed out by deblur, so it all comes out in the wash anyway.
I have not seen such a paper. Such info isn't often reported in individual papers in the literature, so it would be difficult to get good assessments on this except through a very thorough meta-analysis. Info on quality score frequencies is probably more readily available but I can't think of a good source (maybe check out Illumina's documentation).
and I see that the headers are identical to what I have in my samples
Tutorial data header: @M00176:65:000000000-A41FR:1:1101:13767:2175 2:N:0:0
My data header: @M01246:1:000000000-BBF4M:1:1101:17523:1631 1:N:0:1
I see that upon subsampling I have some sequences that are only 35 bases, but there quality is quite high. In my demux-joined.qzv file the sequence quality for the 2nd percentile never drops below 16 for all my sequences, but is that most likely since the paired end reads have been joined?
I’ll try importing only the forward read and let you know what happens.
Just a couple of things from your reply, while you check the forward read.
The 35bp reads might be a problem but for further down analysis and in general they will be discarded from most processing (at least, dada2 and deblur will get rid of them)
The 2nd percentile kind of makes sense as the read joining step assigns arbitrary (“from taking the average, max, min for each position to overwriting to a high value”) Q scores