Strange Interactive Quality Plot after Importing the data

Hello everyone,

I have 444 samples that have been sequenced on NovaSeq 6000 instrument (250-bp paired-end reads).
I am trying to import my raw reads using manifest format. The Version of QIIME is 2022.2

The code that I am using is:

qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path import.txt \
--output-path demux_paired_end.qza \
--input-format PairedEndFastqManifestPhred33V2 

The interactive quality plot that I've received in most of the base positions does not have the box plot.

I have ran the same codes two times: one time for 5 samples and one time for the whole set (444 samples).

In both cases I am getting very similar output for interactive quality plot and I do not see the box plots at different positions mainly before <230 bp.

What could it be the explanation of that?

The plot for 444 samples:

The plot for 5 samples:

Thank you,
Armin

Hi @ari_sh70,

Welcome back to the :qiime2: forum!

Apologies for the delay in response on this - I am taking a look at this and will follow up shortly. Thanks!

Hi @lizgehret

Thanks very much. Really appreciate it!!

Armin

Hi @ari_sh70,
The quality scores of the newest generations of Illumina machines (including NovaSeq in your case) use a new more "streamlined" binning system for quality scores which does not give the traditional continuous Phred scores you may be familiar with, rather it bins the values into 4 values only, thus the very artificial looking quality plots you see. But this is totally expected. Now, how these newer binning systems affect the downstream quality control/denoising is a different discussion.
The DADA2, it looks like, are going to release a new version that can specifically better handle this new data type (as per here), but this is not currently implemented in the q2-dada2 version and I'm not sure how well or bad DADA2 would perform with the binning scores, my guess its error model building step may not do so well, but that is pure speculation on my part!
You could always try merging your reads with q2-vsearch and running the output through Deblur. Of note, the Deblur pre-packaged error model was based on Illumina MiSeq, and as far as I'm aware it has never been benchmarked against the NovaSeq data. My guess however is that it would work fine because the NovaSeq is meant to have more accurate base-calling than MiSeq, meaning that you may be taking a more conservative approach to your QC here.

Btw, the above suggestions are based on the assumption that you have amplicon data (i.e. 16S, ITS), and NOT shotgun data.

4 Likes

Hi @Mehrbod_Estaki ,
Thank you very much for your answer.
Yes, I have amplicon data (16S). As I understood based on what you have said, I should use Deblur instead of DADA2, is that correct?
In your opinion, isn't better if I use Trimmomatic instead, and then after QC, I work in QIIME?

Thank you again,
Armin

Hi @ari_sh70,

This would be my current approach within the QIIME 2 environment. With the upcoming DADA2 release, I would probably switch to that since at least that will have some benchmarks to back it up.
Another caveat you should be aware of is that Deblur tends to drop more reads as their length increases, see an example calculation here. So, if you find yourself losing too many reads after Deblur, maybe consider just using your forward reads only, you would lose some resolution but may retain many more reads. This would ultimately be based on what your overall goals are with the analysis and what you prioritize.

Having never used trimmomatic, I'm not sure what aspects of this approach would be better over the approaches I mentioned above, or with q2-vsearch if you rather not use the denoisers. Can you expand on what your plan would be?

1 Like

Hi @Mehrbod_Estaki ,

Thanks again for your time and all the clear explanations that you are giving.

The reason that I was thinking to use trimmomatic for QC is due to the fact that after importing the raw reads using QIIME, I am not able to see the Interactive Quality Plot properly in order to make a decision for finding a threshold to trunc and trim the low quality reads. Then, I have realized that is because I am dealing with NovaSeq raw reads.
So, I was thinking to denoise using trimmomatic and then import the raw reads after the cleaning! and perform the rest of the analyses in QIIME, if that make any sense!!

Thank you,
Armin

1 Like

Hi @ari_sh70,
I'm still not sure I understand the purpose of using trimmomatic first? Your plots look this way because the underlying quality scores are binned, so, they would look basically the exact same with any other tool you use, so not sure where the advantage is? Perhaps I don't know all the functions of trimmomatic, but isn't its main purpose to trim reads? If so, trimming can be done within QIIME 2 using q2-cutadapt, or within either q2-deblur or q2-dada directly, (whichever you choose to use), and it would just save you one extra step in transferring data between tools and keep your provenance complete right from the first step.

1 Like

Hi @Mehrbod_Estaki ,

Thank you. I was rethinking about what you have told me and telling me and yes, now It is totally clear to me. I will not use trimmomatic to trim the data. I will follow you've said and will do it in QIIME.

Thanks again Mehrbod.
Armin

1 Like