Merge different datasets

iordanis · June 19, 2023, 10:33am

Good evening to the whole community,
I want to ask some questions and I apologize if they are dumb. I want to combine data from 16s rRNA from human gut but from different datasets generated by different methods.
So here are my questions:

What is the best approach to combine datasets obtained via different NGS library prep protocols.
Does the read length have to be the same? For example one set is 2 x 150 bp pair-end and the other 2 x 250 bp pair-end
is there any problem if merged datasets which have been produced using different primer sets e.g (515F and 806R) vs (336F and 806R).
which is the best pipeline step to merge the datasets? I was thinking after the dada2 step.

Thank you in advance for you time and i would greatly appreciate any suggestions!

colinvwood · June 20, 2023, 5:37pm

Hello @iordanis,

There will be biases introduced by different PCR protocols, different library preparations, different sequencing runs and technologies, different 16S regions, different classifiers and underlying databases, different dada2 parameters, and probably more.

Merging after dada2 is possible, and I would say probably the earliest point that makes sense. Depending on your downstream goals, you will run into a strange situation with taxonomic classification--the furthest I would merge before this step is within a primer set because many of the classifiers are trained on only one region. See here for more info about this specific point.

One approach to controlling for the read length variation across datasets is to trim the longer reads to the length of the shorter ones. I'm sure this comes with drawbacks. This only makes sense for datasets targeting the same primer sets.

Disclamier: these are just suggestions/observations, and not a recommendation as to whether this sort of merged analysis should or should not be performed, or what sort of interpretations could be drawn. I'll leave this post queued and others with more insight can reach out.

iordanis · June 21, 2023, 8:39am

Hello @colinvwood,
Thank you very much for the response!

gregcaporaso · June 21, 2023, 4:34pm

Hi @iordanis,
A couple of points in addition to those shared by @colinvwood. It's possible to do this, but you have to be very careful about these protocol variations.

First, if you plan to move forward with this, you should add metadata columns for the different protocol variables (e.g., read length, primer set, ...). This will enable you to visualize whether these differences are leading to systematic differences in your sample collection. For example, after generating PCoA plots, color by these variables and see if they cause separation. You will also be able to include terms for these variables in models where applicable, e.g., qiime composition ancombc or qiime longitudinal linear-mixed-effect.

On your points 1 and 4, I agree - after DADA2. Be sure to use the same trim/trunc parameters to minimize differences.

On your point 2, this read length difference is a bit problematic because 2x150 reads are often not long enough to merge with the two primer sets you mentioned. So, you'll need to trim the longer reads, but you'll also need to only work with the forward reads. You can achieve both of these steps by using the same trun-len parameter to qiime dada2 denoise-single. Given this, you would have to use only your reverse reads, since they use the same primer (the sequences otherwise wouldn't overlap at all). I think you could do this by importing only your reverse reads as single end reads. If you want to go this route, let me know and I can provide some guidance. I'll just need to know if your data are multiplexed or demultiplexed (see notes on this here).

On your point 3, there will be amplification biases with the different primer pairs, so that is something you'll need to be aware of and account for. Start by adding a metadata variable describing this, as I suggested above.

iordanis · June 22, 2023, 8:34am

Hello @gregcaporaso,
Thank you for the advice and for the response. My data is all demultiplexed, if I import the sequences as single end for reverse primers i will ask you again here for your guidance.

mckin · July 13, 2023, 4:24pm

This is really helpful.

system · August 13, 2023, 10:24pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.