Combining Datasets with 2 sets of Primers

jbethany · February 16, 2018, 6:21pm

I have two sets of data (paired reads), with two different primer sets (341/806 and 515/806) that I need to combine. I would like to try trimming my 341/806 set to 515/806 but I'm not sure how to go about doing so before pipelining into dada2.

Thank you!
Julie

colinbrislawn · February 16, 2018, 7:16pm

Hello Julie,

The dada2 developers recommend running dada2 on each run separately then combining the results, which is a perfect fit for your data set. Even better, the dada2 denoise-paired command comes with two --p-trim-left commands for trimming off the start of your 341 read so it starts around 515!

Take a look at that command and let me know if it looks like a good fit for your data.

Analyzing a consistent region is important and I'm glad your post will raise awareness of this for future users.

Colin

jairideout · February 17, 2018, 12:09am

Thanks for the help @colinbrislawn! @jbethany, once you've denoised each sequencing run separately, you can merge the feature tables and representative sequences with the qiime feature-table merge and qiime feature-table merge-seqs commands, respectively (check out the FMT tutorial for examples).

Nicholas_Bokulich · February 17, 2018, 12:28am

Hi @jbethany,

This is a great suggestion but a potential issue occurs to me — the nucleotide positions for these primers are approximate and may not be exact in all bacteria (e.g., slight length variation may cause the V3/V4 domains to be slightly longer/shorter). I do not know off the top of my head how much variation there is in these domains — and it probably is not very extreme — but even a 1 nt difference is enough to cause two otherwise identical sequences (with 1 nt difference) to become separate features. So trimming N nucleotides (e.g., 515 minus 341) to approximate the position could land you in hot water...

You can instead trim at the actual primer sites (in one or both datasets) with q2-cutadapt trim-single or trim-paired. Then denoise with dada2, then merge as @jairideout has suggested.

You can also check out this post from a user who wants to perform what sounds like the same analysis. There are a few different options (trim to the same primer pair, collapse on taxonomy, or use q2-fragment-insertion) to compare datasets. Trimming is possibly the easiest and possibly also the best (depending on your analysis goals).

I hope that helps!

Lu_Yang · February 17, 2018, 2:19am

Hi, @Nicholas_Bokulich,

Currently, I do not think the taxonomy picked after qiime fragment-insertion performs well on assign taxonomy.
I have processed the reads under the two primers. The results do not satisfy me. Different primers produce different OTUs. No overlaps between them. I still do not know why. But my samples should be similar, they are from the same source. I collapse the tax to genus level, the genus level results make sense.
For my understanding, if there is no result of OTU table, it should be unreasonable.
Please let me know what you think. Maybe some procedure I did wrong.
Thanks in advance.
Best

Nicholas_Bokulich · February 17, 2018, 3:11pm

Hi @Lu_Yang,

Thanks for the advice regarding fragment insertion! I have never used that tool but it is designed for analyzing different marker genes together in one analysis (any analysis that uses a phylogeny, that
is) — it is good to hear your observations regarding the actual usefulness of this tool.

This is where q2-cutadapt is useful when you have overlapping primer sets (such as @jbethany has). You can trim both datasets to the same primer sites, dada2 separately, then merge. Identical sequence variants from the two datasets will then merge together.

jbethany · February 17, 2018, 9:22pm

Thanks, Nicholas! I'll try out q2-cutadapt trim-paired; I was having the exact problem you mentioned where the nucleotide positions for the primers are not exact and slight variations were causing identical sequences to be labeled separately.

colinbrislawn · February 17, 2018, 11:47pm

Good afternoon,

This is true of the dada2 denoising algorithm; it's very sensitive to small biological differences, and also to small technical variation. The older OTU picking methods are much less sensitive to this; default of (97% similar == 3% diff) * 300 bp = 9 bp diff can be in a single OTU.

High precision methods may not be a good fit for a messy data set. Greedy heuristic clustering lacks accuracy and precision, but it's flexibility and tolerance make it a reasonable solution for funky data sets.

Colin

Lu_Yang · February 18, 2018, 2:41am

Hi, @Nicholas_Bokulich,

I have another concern. I have also tried to trim the longer reads to the shorter one(515f-806R). However, the result is not desirable. The trimmed samples are all NA taxonomy. Still can not understand why.

Another concern. For my understanding, after analysis by DEBLUR or DADA2. We will get the feature table, which means the sequences are unique. Each unique sequence will be assigned to one taxa.

However, when we assign taxonomy to the feature sequences(get by longer primer), and another sequence (get from 515-806 primer), if two sequences have the same overlaps(shorter one is the same as part of the longer one). Why they can not be assigned to the same OTU?

antgonza · February 18, 2018, 2:25pm

Hello,

Thought in jumping here, hope these suggestions help.

Anyway, at this stage in our field is pretty hard to know a priory the effect size of the different parts or studies and more meta-analyses, everything matters, from sample collection to wet lab and sequence processing. However, we are getting better at this, so before continuing I would suggest checking these 2 papers: Meta-analyses of studies of the human microbiota, and Tiny microbes, enormous impacts: what matters in gut microbiome studies?. At this stage, my suggestion for meta-analyses will be to not change anything in the sequencing processing (and more if they come from different primers); thus, suggest using deblur (and use fragment-insertion for the tree) or close reference with the same length. Note that this will not assure that you will not have a separation due to primer but you might not; really depends on the effect sizes of your datasets.

Now, answering your question and why IMOO using different lengths is complicated. Imagine that you have 5 denoised sequences:

AACC
AACT
AACTT
AACCT
AACCC

Perhaps it will make sense to merge 2 and 3 but how will you merge 1 with 4 and 5? Furthermore, how can you assure there is no other possibilities in nature for those sequences? For example: for 2 and 3 that: AACTA, AACTC, will never occur and is fine to merge 2-3.

My 2 pesos.

Lu_Yang · February 19, 2018, 3:40am

Hi, @antgonza

Thanks for the explain. Now makes much more sense.

system · March 22, 2018, 9:40am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.