Quality control

hjpyun · September 2, 2021, 12:19am

Hi everyone.

There are some confusion in quality control step, and I am new in this field, so it can be very basic thing..

Well for qc step in human genome sequencing, such as GWAS, I remember doing quality control that exclude samples with bad quality. But for here, I see that this kind of filtering step is not included in Qiime2 tutorial.

There are only steps that trim or exclude "reads" based on per-nucleotide basis (according to the paper reference mentioned in tutorial basic quality-score-based filtering ) or other filtering options are in qiime2 tutorial, but I see that this step seems to be run after having feature table.
Also, DADA2 and Deblur is other tools that do qc but I see there is no process that exclude samples, only poor reads.

Well, for this following example is not about "base-quality" but about "read counts", but for example there are some samples with only 43 reads and samples with over 100,000 counts. Very skewed. In this case, should I remove samples with very low reads and with too much reads, in case of contamination or other factors..?

Adjusting sampling depth (in this case some samples under sampling depth will be excluded..right?) is related with sampling size bias.

So, to make it simple I'm wondering can I exclude 'samples', based on 'base-quality' or 'reads count' before making feature table?

Thank you in advance

hjpyun · September 2, 2021, 12:21am

sorry I'm re-uploading image with full size

timanix · September 2, 2021, 7:32am

Hello!
Dada2 and Deblur will perform quality filtering on its own. Since each sample contains numerous sequences, quality control is performed on each sequence, not by samples.
You can obtain a feature table with poorly represented samples and then just filter a table to get rid of samples that are lower than a certain threshold in number of sequences. Also, you can perform taxonomy based filtering after taxa assignment.

BTW, you have a sample (max) that contains really a lot of reads. Did you perform demumltiplexing or you got already demultiplexed reads?

hjpyun · September 2, 2021, 11:08am

Hello @timanix Thanks for reply

So, as we don't usually exclude samples during preprocessing qc,
but, if there are samples with "low quality base" not "low read counts", it would be nice to exclude..right..?

And yes I imported multiplexed data and demultiplexed them, but the result is just like that. In fact, I have 2 other similar dataset and they look exactly same.. I checked if same adapters are making problem, but sample with too much reads were not same across datasets
(I demultiplexed them separately since each set had same adapters so I couldn't merge files and do demux)

Thank you!

timanix · September 2, 2021, 12:24pm

After denoising reads with low QC will be dropped from the samples, so in the rest you have samples only with good reads and any samples with mostly "bad" reads will be poorly represented and you can filter them out if needed.

I had a similar issue with NovaSeq data and ended up using Sabre to demultiplex it outside of Qiime2 since it gave me better results than Cutadapt. IDK if it is pecific to NovaSeq or just my barcodes.

Yeah, that's making sense