merging sequence runs with slight primer variation

Hi all,

initial analysis done with qiime version: qiime2-2019.10

I have two data sets of 18S rRNA gene sequences from different sequence providers. The only difference is the reverse primer has an extra trailing TGA

Data set 1:

Data set2:

For the initial analysis (after using cut-adapt for the different primers) i ran both datasets separately using DADA2 and used the same parameters for both for example:

qiime dada2 denoise-paired 
--i-demultiplexed-seqs dataset1_demux-paired-end.qza 
--p-trunc-len-f 270 
--p-trunc-len-r 270 
--p-max-ee-f 2 
--p-max-ee-r 5 
--p-n-threads 2 
--o-table dataset1_table-dada2.qza 
--o-representative-sequences dataset2_rep-seqs-dada2.qza 
--o-denoising-stats dataset1_denoising-stats.qza`

Each dataset is classified with the same PR2 eukaryotic database.

I thought it would be good to combine the datasets. I did this using the qiime feature-table mergeand qiime feature-table merge-seqs commands. However, i don't get a single overlapping ASV. These data sets are from the same waters samples just separated by a different filter size.

I am wondering:

  • If this is from an error in my analysis? could the extra TGA effect something? Does length effect dada2?

  • Should i merge after cutadapt and before DADA2? from reading the forum i thought this was not needed.

  • The merge is correct and the data is to be trusted?

Thanks for any help

Hi @spongebob,
I think what's happening here is that the amplicons from the 565F/948R analysis are slightly longer than the amplicons from the 565F/948R-modified analysis. You should be able to see that difference if you look at the qiime demux summarize output for both of your denoise-paired input files. If that's the case, you shouldn't expect to see any overlapping ASVs, since that overlap is done based on exact matches, and the length difference will throw off the exact match.

What might work best here would be to cluster the ASVs in your merged table using the q2-vsearch plugin. You can try this using the qiime vsearch cluster-features-de-novo --p-perc-identity 1.0, which will cluster your ASV sequences at 100% identity. I think the way the percent identity calculation is performed, your ASVs will cluster if they only differ based on the three bases at the end, as terminal gaps are not counted as differences in the sequences.

Can you let us know how that works out either way? It'll be helpful to know if this works for the future.

One thing to be aware of is that those extra three bases in the 948R-modified primer will introduce biases across the runs. I suspect that the differences would be minor, but any time your primers differ at all primer biases are likely to show up.


1 Like

Attached are the outputs of the qiime demux summarize for each dataset. The only one of different length is the forward read on dataset 1:

Is this a cutadapt error? The reads are all 2*300pb before using cutadapt for both datasets - so i know that something got trimmed. I ran as follows:

For dataset 1 i ran as:
qiime cutadapt trim-paired --i-demultiplexed-sequences dataset1_demux-paired-end.qza --p-cores 2 --p-front-f CCAGCASCYGCGGTAATTCC --p-front-r ACTTTCGTTCTTGATYRA --p-adapter-f TYRATCAAGAACGAAAGT --p-adapter-r GGAATTACCGCRGSTGCTGG --o-trimmed-sequences dataset1_demux-paired-end_PT.qza --verbose

For dataset 2 i ran as:
qiime cutadapt trim-paired --i-demultiplexed-sequences dataset2_demux-paired-end.qza --p-cores 2 --p-front-f CCAGCASCYGCGGTAATTCC --p-front-r ACTTTCGTTCTTGATYRATGA --p-adapter-f TCATYRATCAAGAACGAAAGT --p-adapter-r GGAATTACCGCRGSTGCTG --o-trimmed-sequences dataset2_demux-paired-end_PT.qza

If this is correct then the only other thing is that checking back on my logfiles the two datasets were run with different versions of qiime:

qiime2-2019.10 and qiime2-2021.4

If this does not make a difference then i'll run vsearch and let you know how it goes.

Any further insights much appreciated .


Hi @spongebob, The version of cutadapt that QIIME 2 uses changed between qiime2-2019.10 and qiime2-2021.4, so it's definitely possible that the trimming functionality works differently between the two versions. And, I would have expected the dataset2 reads to be shorter than the dataset1 reads post-cutadapt, since the dataset2 primer is longer.

I recommend re-running both with QIIME 2 2023.5, or at the very least running both through the same version of QIIME 2, and comparing these results again.

Double checked the primers and which dataset they go with - they are correct. I re-ran both datasets through qiime 2023.5.

I get the same lengths as before.

I ran dada2 with the same parameters for both datasets:

qiime dada2 denoise-paired 
--i-demultiplexed-seqs PrimerTrimed.qza 
--p-trunc-len-f 270 
--p-trunc-len-r 270 
--p-max-ee-f 2 
--p-max-ee-r 5 
--p-n-threads 4 
--o-table table.qza 
--o-representative-sequences rep-seqs.qza 
--o-denoising-stats denoising-stats.qza

I retained about 70-80% of the reads

I then tried to merge the ASVs ( using qiime feature-table merge and qiime feature-table merge-seqs commands) --- i get the same issues -- no overlapping ASVs.

I am now running vsearch to cluster the data.

A couple of questions:

  1. I have already classified each dataset - can i filter that taxonomy file to get the ASVs classifications for my clustered data or will i need to run it again?

  2. whats would be wrong with combining the forward and reverse reads from each dataset (after cutadapt) and then running dada2?

here are the vsearch results:

I first merged the tables and the rep-seqs files:

qiime feature-table merge --i-tables PF_table.qza --i-tables SV_table.qza --o-merged-table PF_SV_merged_table.qza

qiime feature-table merge-seqs --i-data PF_rep_seqs.qza --i-data SV_rep_seqs.qza --o-merged-data PF_SV_merged_rep_seqs.qza 

Then i ran vsearch:

qiime vsearch cluster-features-de-novo  --i-sequences PF_SV_merged_rep_seqs.qza --i-table PF_SV_merged_table.qza --p-perc-identity 1.0 --p-threads 14 --o-clustered-table PF_SVEX_clustered_table.qza --o-clustered-sequences PF_SVEX_clustered_rep_seqs.qza

Here is the summary before vsearch

Here is the summary after vsearch

So some ASVs do now overlap.........i need to think about what this means and if i should use this data or if i should keep my datasets separately

Interesting. I suspect the different primer length is resulting in slightly different ASV sequences, which is resulting in the behavior that you're seeing.

A couple of questions:

First, I am not sure what each of the tables you provided in your last message is exactly (i.e., which run, and why four tables for the two runs) - apologies if I'm missing something, just jumping back into this after a couple of weeks.

Second, I would want to know if the shared ASVs make up the majority of the ASVs in terms of relative abundance. If so, that suggests that this is probably working ok, and that you have a lot of low abundance ASVs (which isn't super uncommon). You could figure this out by filtering the feature table, say to features that show up in at least 10 samples, and then reassess the fraction of ASVs that are shared.

You should be able to use the existing FeatureData[Taxonomy], as long as the ASV IDs haven't changed. You can always have extra feature ids in this file, relative to the feature ids in the table that you're working with.

DADA2's error model assumes that all samples were sequenced on the same run, so it won't perform correctly if you've combined multiple sequencing runs.

4 tables as i posted the results of before and after running cutadapt - just to show cutadapt did work.

I played around with the shared ASVs and the amount shared makes sense for our data. So i think we are happy to move forward with it.

Thanks for the tips and help with this issue!

Hi @spongebob, Glad things seem to be looking good now.

I wanted to share a comment that @SoilRotifer shared with me directly:

Let us know if you have questions about this.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.