FastQC compared to QIIME 2 quality plot

I am seeing something that is making me wonder. The plot that I got using QIIME 2 is very different from fatsqc. The only difference in the data is that I demutiplexed with qiime first because that was how qiime gave me the report but in fastqc, I used the multiplexed data all together. What I get using fastqc is similar to the plots out sequencing center gave us.

Fastqc

Sequencing center

QIIME2

I am very confused. Everything seems very high quality with fastqc but not qiime 2 unless I am reading that plot the wrong way. Any hint?

Hi @Negin,

Overall, I think the 3 plots for the most part agree with each other quite nicely. Not knowing any more details about the fastQC and your sequencing center’s calculations I’m going to guess that some of the source of ‘minor’ variation you see is as follows:
After demultiplexing in qiime2 you may have lost some reads if they failed to be assigned to a sample, meaning the pool used for creating those plots are slightly different. In qiime2, the plots are created using a
a random subset of 10,000 sequences, meaning that if fastQC and your sequencing center are creating plots using all the reads they will of course be different as well in that regards. Finally, there could be some minor differences in the graphical plotting. For example, fastQC uses 10% and 90% points for the plot whiskers, while qiime2 uses 91st and 9th percentile. Further variation may happen if one script decides to include outliers in their plotting calculations while another may choose to exclude it before calculating percentiles. Again, all of these would lead to very minor variation in plots but nothing you should be worried about moving forward.
If you really wanted the most accurate plots possible, I would demultiplex first and then plot QC scores using a tool that incorporates all of your reads. Not sure if that’s what fastQC does or not, couldn’t find it in their docs. Anyways, if I were you, I would just move forwards with the qiime2 plots!

1 Like

I agree with Mehrbod,
FastQC in general outputs nicer plots than Qiime2, but that is because of how it calculates. Qiime2 plots illustrate much more the outliers but I would go forward because your main data (in black) is high quality. And keep in mind every “sequence base” value is an estimation so there may be variation. Lastly, you have three QC calculations and all looks nice.

1 Like

Okay thanks :slight_smile: Yes, the quality was very high with this round of sequencing. However, I lost most of my negative control reads after dada2 filtering step.

Awesome. Thanks! Do you think inputting demultiplexed files to QIIME would make any difference? Meaning that, would it be possible that QIIME would miss assigning some reads compared to the sequencing center? I only used multiplexed files because it was faster to read all at once into qiime. Reading in just two files (R1 and I1) instead of having so many demultiplexed sequences

It’s hard to say, probably not. The only difference might occur if different similarity thresholds are set between qiime2 demultiplexing and your sequencing facility. For example, some sequencing facilities allow for 1 mismatch in their barcodes because their barcodes are designed with the Hamming distance of those in mind, meaning 1 mismatch might not assign to a different sample. I’m guessing in qiime2, mismatches are not allowed, but perhaps @thermokarst or @Nicholas_Bokulich can confirm this. That means that perhaps reads that have 1 barcode mismatch might be dropped completely in qiime2, whereas they may have been saved with a different algorithm.
Like I said before though, the differences would be so minor that I wouldn’t get caught up with this :stuck_out_tongue:

2 Likes

That's correct — no barcode error correction.

2 Likes

Hi Mehrdad,

I decided to try analyzing my data using paired-end reads from the exact same sequences. Weirdly, even though this is the exact same dataset, qiime plot I am getting is of much lower read quality. So what you see above is from the forward read. What I am uploading here is both reads. They are very different and even only looking at dark parts of the plot, still shows very low quality. Fastqc again shows much higher quality.

and out of the 35 samples, only 13 are shown here. I don't know what happened to the rest.

Hey there @Negin - can you share the original QZV files used to generate the single and paired versions of the demux summary plots? We are missing a lot of critical information when you only provide the box plots.

1 Like

Hi @thermokarst

Wouldn’t that mean that everyone would have access to my sequences if I share them here?

1 Like

No, not if you share the QZVs.

ah okay. cool then. Here are the qzv files:

Here is the one for single-end
V4-20181012-demux.qzv (287.4 KB)

here is the paired-end
V4-20181012-demux-p.qzv (293.0 KB)

1 Like

Okay, that is really helpful! First, your paired-end plot is based on 13 sequences (!!!), while your single-end plot is based on 555154.... So, that is a good first step in tracking down why the plots look so drastically different (the sample size for the box plots are orders of magnitude different). BTW, I pulled that information from the table on the first page of the viz near the top.

Second, by looking through the provenance of the visualizations, I noted that you demuxed the single-end reads using the --p-no-rev-comp-mapping-barcodes flag, while, the paired-end reads used the --p-rev-comp-mapping-barcodes flag, which is going to completely change how the demux process works. In the case of the paired-end reads, this took the reverse complement of your barcodes before demuxing.

So, before proceeding, maybe it makes sense to determine exactly what you need first --- do you need to take the reverse complement? Seems like that is a "no" but, that is something you should talk to whoever did sample prep (or maybe your sequencing center) about.


As far as your original post, where you were comparing the quality plots between QIIME 2 and FastQC --- what exactly did you plot in FastQC? Was it the original demuxed data, or was it one single sample? Again, we are missing some critical context there to help you. For example, for the Moving Pictures dataset, here is the FastQc plot for the full, multiplexed reads:

And here is the QIIME 2 demux summary, which is generated on the demultiplexed reads (which means not all reads will be present due to barcode errors), as well, the plot is made using a subsample of sequences (by default, 10,000):

Finally, the last thing to consider is that FastQC is binning nt positions, while QIIME 2 is plotting a boxplot for each nt position.

Hope that helps! :t_rex: :qiime2:

2 Likes

Hi @thermokarst

Thank you for all the explanations. This makes sense. So as for your second question, I mentioned above that I used multiplexed file for making the fastqc but demultiplexed one for qiime2.

For your first note, I am pretty sure I should not take the reverse complement of the barcodes because obviously, what I am getting for the single-end makes more sense in terms of number of samples that I have. I used the paired-end code from the Atacama soil microbiome tutorial and I was not aware that this code is taking the reverse complement. I will try again with the correct code and update you.

Thanks so much

1 Like

Hi @thermokarst

I tried what you said and it worked. Thanks! Just a quick question, how should my metadata changed from single-end to paired end in terms of the LinkerPrimerSequence? I used the V4 primer for read1 in my metadata when I used single-end. What should I include for the paired-end?

I am not sure that metadata column is actually being used here (i.e., the atacama tutorial), so do not worry about it. Please correct me if I am wrong or if I am contradicting other advice — I have not followed this whole topic thread so am not aware of all commands you are running.

I hope that helps!

but in general, would the LinkerPrimerSequence be ever used because I don;t know how to include both forward and reverse primers in my metadata file if needed.