Differences in species composition between QIIME1 and QIIME2 at the phylum level

Hi!
I have encountered a problem that has not been solved for a long time. I have checked the forum articles, but the effect is not good.
I used QIIME2 to generate the relative abundance table of the phyla level, and used the R language to draw a stacked histogram. The graph showed that the top 4 phyla in relative abundance was completely different from the phyla sequence made by QIIME1.
My data is paired-end 16S rRNA sequencing. First, I use dada2 to generate table and rep seqs, and the interception length is 220 (Is this value setting reasonable?):


time qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 220
--p-trunc-len-r 220
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats dada2-stats.qza
--p-n-threads 0
Then, use the comparison with the silva database to obtain taxonomy:
time qiime feature-classifier classify-sklearn
--i-classifier silva-138-99-515-806-nb-classifier.qza
--i-reads rep-seqs.qza
--o-classification taxonomy.qza
Finally, output the relative abundance table of the phylum level and draw a histogram in R language:
time qiime taxa collapse
--i-table table.qza
--i-taxonomy taxonomy.qza
--p-level 2
--o-collapsed-table phyla-table.qza

qiime feature-table relative-frequency
--i-table phyla-table.qza
--o-relative-frequency-table rel-phyla-table.qza

qiime tools export
--input-path rel-phyla-table.qza
--output-path rel-phyla-table

biom convert -i feature-table.biom -o rel-phyla-table.tsv --to-tsv

However, after this, there was a significant difference between the results of QIIME1 and QIIME2. Figure 1 shows the results of QIIME1 and Figure 2 shows the results of QIIME2. I don't know why this difference occurred.QIIME1
image

Thanks for any suggestions!

Hi!
One quick note that I would like to make is that you are applying very strict filtering by setting --p-trunc-len parameters to 220. All sequences that are shorter than 220 will be discarded. Did you check how many sequences passed through the filtering step after Dada2?
If you don't want to cut the sequences, you should disable above mentioned flags or set it to lower values to keep more sequences for post analysis. So I suggest to try it and check if the issue will reappear.

Hi!@ Timur Yergaliyev
What you said is very correct. After DADA2 filtering, most of the sequences are filtered out. Is it because "--p-trunc-len 220" is not added?
Thanks a lot!

Is it a new output with other parameters or the same that you got before asking here?
I can see on the screenshot that about 90% of reads are passing the filter, but you are loosing a lot on the merging step. What is expected amplicon size? Do you you have enough of pairs in overlapping region (at least 12 bp) to merge forward and reverse reads?

Hi!@ Timur Yergaliyev
The picture just now is the result before I added "--p-trunk-len 220".The code used is as follows:
qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path manifest_fix.csv
--output-path paired-end-demux.qza
--input-format PairedEndFastqManifestPhred33V2

qiime demux summarize
--i-data paired-end-demux.qza
--o-visualization paired-end-demux.qzv

time qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 220
--p-trunc-len-r 220
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats dada2-stats.qza
--p-n-threads 0

This is the QIIME2 code before adding the "-p-trunk-len 220" statement. Is there a problem with the code?If not, I will try to run DADA2 again after adding "--p-trunk-len 220".

Thanks a lot!

Hi!@ Timur Yergaliyev
When I tried to run dada2, some errors occurred, warning me "(1/1?) no such option: --p-trunc-len (Possible options: --p-trunc-len-f, --p- trunc-len-r)", what should I do next?

time qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 220
--p-trunc-len-r 220
--p-trunc-len 220
--o-table table0629.qza
--o-representative-sequences rep-seqs0629.qza
--o-denoising-stats dada2-stats0629.qza
--p-n-threads 0

Thank you

Hi again.
Sorry, looks like I wasn't clear enough with --p-trunc-len parameters. There is no such option in Dada2-paired as --p-trunc-len, I used it as short for both --p-trunc-len-f and --p-trunc-len-r and suggested you to disable it or to set them to a lower than 220 value, for example, to 210.
But later you added a screenshot, according to which most of the reads are passing the filter, but not properly merged. So, what is expected amplicon size? Do you you have enough of pairs in overlapping region (at least 12 bp) to merge forward and reverse reads?

According to this commands, you didn't cut primers from your data:

It may also affect taxonomy classification process. It is recommended to use cutadapt to remove primers or apply --p-trim-left-f integer(length of primer) to strip them directly in DADA2.

In addition, to help us understand your workflow, it would be nice if you will provide us with following information:

  1. What did you used to cluster and classify in QIIME1? Are you comparing sklearn vs blastn?
  2. How sequences processed by Q1 where filtered/QCed?
1 Like

Hi!@Profile - timanix - QIIME 2 Forum
Thank you for your patience to help me analyze the cause of the problem.

I don’t have any expected amplicon size, and I don’t know if there is enough overlap, because I don’t know how to view these.

The sequencing company has removed the primers. The manifest I imported is paired-end sequencing data with primers and barcodes removed.
In addition, are there any other possible reasons that cause a large number of sequences to be filtered out during merging?

Thanks again!

Hi again!

You can calculate it if you know targeted region of your library. For example, it can be V4 or V3-V4. You may ask the person who prepared a library before sequencing or who designed the experiment. They should know expected amplicon size.

In my opinion, most probably your reads are to short. Here is the illustration of merging


So if overlapping region is less than 12 (default) DADA2 will not merge it.

If what I wrote above is true for your dataset, you can try to run your data with disabled truncating parameters and decreased overlapping region:

qiime dada2 denoise-paired
  --i-demultiplexed-seqs paired-end-demux.qza
  --p-trim-left-f 0
  --p-trim-left-r 0
  --p-trunc-len-f 0
  --p-trunc-len-r 0
  --p-min-overlap 6
  --o-table table.qza
  --o-representative-sequences rep-seqs.qza
  --o-denoising-stats dada2-stats.qza
  --p-n-threads 0

I set minimum overlap to 6 but you can change it.

3 Likes

Hi again.@ Timur Yergaliyev
The sequencing of my project is V4 region, namely 515/806R, therefore, the expected amplicon size should be 806-515+1=292bp.

I'll try to run my data with "--p-min-overlap 6".But why are the parameters after --p-trunk-len-f and --p-trunk-len-r set to 0 instead of 220?

Thank you very much!

That's should be more than enough for reads to overlap.

I suspected earlier that your reads are not overlapping, so I suggested that you will set

to disable truncating of your reads and additionaly decrease minimum overlapping region.

But since your expected amplicon size is less than 300, your reads should be long enough to overlap. So it is a mystery to me why the merging rate is so low.
With updated information about expected amplicon size, and taking into the account that primers already removed, I would try to run dada2 with two following commands to compare:

qiime dada2 denoise-paired
  --i-demultiplexed-seqs paired-end-demux.qza
  --p-trim-left-f 0
  --p-trim-left-r 0
  --p-trunc-len-f 165
  --p-trunc-len-r 150
  --o-table table.qza
  --o-representative-sequences rep-seqs.qza
  --o-denoising-stats dada2-stats.qza
  --p-n-threads 0

and

qiime dada2 denoise-paired
  --i-demultiplexed-seqs paired-end-demux.qza
  --p-trim-left-f 0
  --p-trim-left-r 0
  --p-trunc-len-f 0
  --p-trunc-len-r 0
  --p-min-overlap 6
  --o-table table.qza
  --o-representative-sequences rep-seqs.qza
  --o-denoising-stats dada2-stats.qza
  --p-n-threads 0

If merging still will be very low, you may consider using only forward reads without merging them with reverse reads. But hopefully, one of them will work.

@lys Updated comment

Unfortunately, I tried to run my data, but it seems to be lower.

After I run this code, the warning shows "(1/1?) no such option: --p-min-overlap" .

Which version of Qiime2 are you using? This option was added recently to q2-plugin and I suspect you need the latest vesion of Qiime2

The version of my Qiime2 is 2021.2. Should I use the newer version?
Thanks!

Yes, 2021.4, I believe

OK.Thank youI will update the version first.
Thank you very much!

Hi! Timur Yergaliyev

After running this code, there are fewer remaining sequences after merging.

But after this code ran, it worked. Filter out only a small amount of data when merging.
So, comparing these two results, why do I have a bad filtering situation?
Is this last kind of code reasonable compared to the setting --p-trunk-len?

Thanks a lot.

Hi,
I am glad to hear that you have some progress. I discussed your dataset with others moderators and we concluded, that quality scores of your reads looks like NovaSeq reads. Moreover, in my experience, according to the graph, it looks like primers are still attached (usually after cutting primers quality scores at the end of reads look different from your data). So, we are suspecting that sequencing center removed barcodes and Illumina adapters, but not primers for V4 region. You can check by searching for your primers in reads. Or you can directly run cutadapt in Qiime2 to remove primers with option to discard sequences without primers enabled. I can help you with commands if necessary later when I will have access to my laptop. After cutadapt, you can visualize resulted reads and check quality and reads length again to decide how to run Dada2.

If I am right and primers are still attached, it will affect the actual length of your reads, and consequently the overlapping region, decreasing it. It would be a better idea to cut primers with cutadapt, create a visualization and decide again how to run Dada2, since primers in sequence will affect taxonomy classification.

Hi! Timur Yergaliyev
Thank you very much for your contribution to this issue

The project is indeed based on the Illumina NovaSeq6000 sequencing platform, but I took a few samples and searched for the V4 region 515F forward primer sequence: CCTAYGGGRBGCASCAG, and 806R reverse primer sequence: GGACTACNNGGGTATCTAAT, but none of them were found. Is cutadapt still needed?

Thank you very much!

HI @lys!
Looks like primers are removed indeed, but

is not the 515F primer but 341F (example)

So you are targeting V3-V4 region instead. Double check it with a person who prepared a library or who designed an experiment.
Now it is clear that --p-min-overlap and disabling truncating parameters helped you to recover your reads after merging.
If primers are removed, you don't need to run cutadapt.
You should check merging stats again, and if you are still loosing a lot on merging step, try to decrease --p-min-overlap to 3, but if stats are reasonably good as it is then it is not recommended. Then you can proceed with further analysis.

1 Like