initial analysis done with qiime version: qiime2-2019.10
I have two data sets of 18S rRNA gene sequences from different sequence providers. The only difference is the reverse primer has an extra trailing TGA
Data set 1:
565F CCAGCASCYGCGGTAATTCC
948R ACTTTCGTTCTTGATYRA
Data set2:
565F = CCAGCASCYGCGGTAATTCC
948R-modified = ACTTTCGTTCTTGATYRATGA
For the initial analysis (after using cut-adapt for the different primers) i ran both datasets separately using DADA2 and used the same parameters for both for example:
Each dataset is classified with the same PR2 eukaryotic database.
I thought it would be good to combine the datasets. I did this using the qiime feature-table mergeand qiime feature-table merge-seqs commands. However, i don't get a single overlapping ASV. These data sets are from the same waters samples just separated by a different filter size.
I am wondering:
If this is from an error in my analysis? could the extra TGA effect something? Does length effect dada2?
Should i merge after cutadapt and before DADA2? from reading the forum i thought this was not needed.
The merge is correct and the data is to be trusted?
Hi @spongebob,
I think what's happening here is that the amplicons from the 565F/948R analysis are slightly longer than the amplicons from the 565F/948R-modified analysis. You should be able to see that difference if you look at the qiime demux summarize output for both of your denoise-paired input files. If that's the case, you shouldn't expect to see any overlapping ASVs, since that overlap is done based on exact matches, and the length difference will throw off the exact match.
What might work best here would be to cluster the ASVs in your merged table using the q2-vsearch plugin. You can try this using the qiime vsearch cluster-features-de-novo --p-perc-identity 1.0, which will cluster your ASV sequences at 100% identity. I think the way the percent identity calculation is performed, your ASVs will cluster if they only differ based on the three bases at the end, as terminal gaps are not counted as differences in the sequences.
Can you let us know how that works out either way? It'll be helpful to know if this works for the future.
One thing to be aware of is that those extra three bases in the 948R-modified primer will introduce biases across the runs. I suspect that the differences would be minor, but any time your primers differ at all primer biases are likely to show up.
Hi @spongebob, The version of cutadapt that QIIME 2 uses changed between qiime2-2019.10 and qiime2-2021.4, so it's definitely possible that the trimming functionality works differently between the two versions. And, I would have expected the dataset2 reads to be shorter than the dataset1 reads post-cutadapt, since the dataset2 primer is longer.
I recommend re-running both with QIIME 2 2023.5, or at the very least running both through the same version of QIIME 2, and comparing these results again.
I then tried to merge the ASVs ( using qiime feature-table merge and qiime feature-table merge-seqs commands) --- i get the same issues -- no overlapping ASVs.
I am now running vsearch to cluster the data.
A couple of questions:
I have already classified each dataset - can i filter that taxonomy file to get the ASVs classifications for my clustered data or will i need to run it again?
whats would be wrong with combining the forward and reverse reads from each dataset (after cutadapt) and then running dada2?
Interesting. I suspect the different primer length is resulting in slightly different ASV sequences, which is resulting in the behavior that you're seeing.
A couple of questions:
First, I am not sure what each of the tables you provided in your last message is exactly (i.e., which run, and why four tables for the two runs) - apologies if I'm missing something, just jumping back into this after a couple of weeks.
Second, I would want to know if the shared ASVs make up the majority of the ASVs in terms of relative abundance. If so, that suggests that this is probably working ok, and that you have a lot of low abundance ASVs (which isn't super uncommon). You could figure this out by filtering the feature table, say to features that show up in at least 10 samples, and then reassess the fraction of ASVs that are shared.
You should be able to use the existing FeatureData[Taxonomy], as long as the ASV IDs haven't changed. You can always have extra feature ids in this file, relative to the feature ids in the table that you're working with.
DADA2's error model assumes that all samples were sequenced on the same run, so it won't perform correctly if you've combined multiple sequencing runs.