Quality-filter q-score-joined file retains all reads

Paraqui · March 1, 2018, 2:41pm

I am using this pipeline in analyzing my Illumina 2x250 paired end reads (which I received demultiplexed with barcodes and primers removed) for 16s v4. I checked the demux-joined-filter-stats.qzv file but it seems to have retained all the reads. Is this unusual?

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path fastq_files \
  --source-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path demux-paired-end.qza

qiime demux summarize \
  --i-data demux-paired-end.qza \
  --o-visualization demux-paired-end.qzv

qiime vsearch join-pairs \
  --i-demultiplexed-seqs demux-paired-end.qza \
  --o-joined-sequences demux-joined.qza

qiime demux summarize \
  --i-data demux-joined.qza \
  --o-visualization demux-joined.qzv

qiime quality-filter q-score-joined \
  --i-demux demux-joined.qza \
  --o-filtered-sequences demux-joined-filtered.qza \
  --o-filter-stats demux-joined-filter-stats.qza

qiime metadata tabulate \
  --m-input-file demux-joined-filter-stats.qza \
  --o-visualization demux-joined-filter-stats.qzv

qiime deblur denoise-16S \
  --i-demultiplexed-seqs demux-joined-filtered.qza \
  --p-trim-length 292 \
  --p-sample-stats \
  --o-representative-sequences rep-seqs.qza \
  --o-table table.qza \
  --o-stats deblur-stats.qza

qiime feature-table summarize \
  --i-table table.qza \
  --o-visualization table.qzv

qiime deblur visualize-stats \
  --i-deblur-stats deblur-stats.qza \
  --o-visualization deblur-stats.qzv

qiime alignment mafft \
  --i-sequences rep-seqs.qza \
  --o-alignment aligned-rep-seqs.qza

qiime alignment mask \
  --i-alignment aligned-rep-seqs.qza \
  --o-masked-alignment masked-aligned-rep-seqs.qza

antgonza · March 2, 2018, 4:21pm

Hi @Paraqui,

Just to rephrase your question: what's the usual amount of sequences discarded due to quality filtering?

Well, that's an excellent question and I'm not aware of any actual publications about this; perhaps @Nicholas_Bokulich has see one. Anyway, in our runs we always see some sequences being drop and the amount changes depending on the run quality but in fairness we know that sequencing is not perfect so we always expect this.

Now, I think you could see this due to a few options:

Your sequences are great,
Your sequences are not really Casava 1.8,
You are working with joined sequences and with my limited experience in this step I think vsearch actually generates it's own quality for each of the position of the sequences joined - remember that there are different methods for this: from taking the average, max, min for each position to overwriting to a high value. Perhaps an easy test to discard this is just to import/filter the forward reads and see what happens.

Hope this helps,

Nicholas_Bokulich · March 2, 2018, 4:32pm

That sounds a bit unusual (I'd expect at least a few very noisy sequences in there) but maybe not. The default q-score-joined uses a fairly low min q-score (4) and even then the reads will only be discarded if they are trimmed shorter than the min length parameter. Those trimmed but unfiltered seqs will probably be tossed out by deblur, so it all comes out in the wash anyway.

I have not seen such a paper. Such info isn't often reported in individual papers in the literature, so it would be difficult to get good assessments on this except through a very thorough meta-analysis. Info on quality score frequencies is probably more readily available but I can't think of a good source (maybe check out Illumina's documentation).

Paraqui · March 5, 2018, 2:23pm

I downloaded the Casava 1.8 paired-end demultiplexed fastq files from the importing data tutorial

https://docs.qiime2.org/2018.2/tutorials/importing/#casava-1-8-paired-end-demultiplexed-fastq

and I see that the headers are identical to what I have in my samples

Tutorial data header:
@M00176:65:000000000-A41FR:1:1101:13767:2175 2:N:0:0

My data header:
@M01246:1:000000000-BBF4M:1:1101:17523:1631 1:N:0:1

I see that upon subsampling I have some sequences that are only 35 bases, but there quality is quite high. In my demux-joined.qzv file the sequence quality for the 2nd percentile never drops below 16 for all my sequences, but is that most likely since the paired end reads have been joined?

I'll try importing only the forward read and let you know what happens.

Thanks!

antgonza · March 5, 2018, 2:55pm

Just a couple of things from your reply, while you check the forward read.

The 35bp reads might be a problem but for further down analysis and in general they will be discarded from most processing (at least, dada2 and deblur will get rid of them)
The 2nd percentile kind of makes sense as the read joining step assigns arbitrary ("from taking the average, max, min for each position to overwriting to a high value") Q scores

Let us know how it goes ...

Paraqui · March 12, 2018, 1:22pm

So I ran this to see how the quality filter affects the non-paired reads

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path fastq_files \
  --source-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path demux-paired-end.qza

qiime demux summarize \
  --i-data demux-paired-end.qza \
  --o-visualization demux-paired-end.qzv

qiime quality-filter q-score \
  --i-demux demux-paired-end.qza \
  --o-filtered-sequences demux-paired-end-filtered.qza \
  --o-filter-stats demux-paired-end-filter-stats.qza

qiime metadata tabulate \
  --m-input-file demux-paired-end-filter-stats.qza \
  --o-visualization demux-paired-end-filter-stats.qzv

and I got the following results. Most of the reads appeared to be retained

demux-paired-end-filter-stats.qzv (1.2 MB)

system · April 12, 2018, 7:22pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.