Merging data from different runs and plates

Alex_14262 · March 15, 2018, 10:28am

Hello,

I have DNA sequences that have been sequenced in two different runs, coming from two different 96 well plates. However, each plate was also sequenced twice so that in total I have: plate1-run1, plate1-run2, plate2-run1, and plate2-run2. Now I have some questions regarding the merging phase.

I have been reading around the forum for a bit and looking at the tutorials to try and figure the best way to do this. At the moment I am waiting for the DADA2 to end, and then I was planning to merge the two runs from each plate using the 'sum' option in feature-table merge (since I will have same sample IDs for the different runs of the same plate), and then merge the resulting tables with the 'error_on_overlapping_sample' option (as the samples between plates are different). So in the end I would have one big table for all the features.

Another question is, having looked at the data prior dada2, I spotted one or two outliers in terms of number of sequences per sample, so I think I should remove these. Should I do it before merging the tables? Could it affect in any way the reliability of the data (especially when it comes to adding up OTUs with sum option)?

Just wanted to double check these facts as I wasn't sure I got it right from the different posts.

Many thanks

ebolyen · March 15, 2018, 9:40pm

Hi @Alex_14262,

I like your merging strategy. You can technically use sum and pass all four, but by doing it in two steps, you can be sure that your samples were split between the plates like you expect.

You absolutely can, you can also let rarefaction take care of it implicitly depending on what analysis you are doing.

Maybe. How much feature/OTU overlap are you expecting between the runs?
Since the counts are per-sample compositions, you could theoretically end up in this position:

Run 1
      f1  f2  f3
s1    0   10  19    = 29
s2    50  11  93    = 154
s3    22   1   0    = 23

Run 2
      f1  f2  f3
s1   100 100 190    = 290
s2    51   9  94    = 154
s3     0   1   0    = 1

Adding s1 together, you are adding a composition that sums to 29 to a composition that sums to 290 which may not be a problem after rarefaction, but because f1 isn't observed at all in Run 1 you end up in a situation where you are combining (0, 34%, 66%) with (25%, 25%, 50%) which strikes me as problematic.

Adding s2 together isn't a problem as far as I know.

Run 2 of s3 kind of looks like a bad sample, so in a way you've cheated Run 1's s3 out of "half" its features.

I'm honestly not super certain how you should handle this. @Nicholas_Bokulich, any suggestions?

Another option would be to keep the runs as separate sample ID's initially so: r1.s1 and r2.s1 in the above example. Then after rarefaction, you could use group and set the mode to sum or you could also use median-ceiling/mean-ceiling that should avoid the issue.

Nicholas_Bokulich · March 15, 2018, 10:18pm

I like this approach, at least for initial data exploration. By keeping runs separate, you can see how much samples differ from each other on each run, e.g., is there a significant difference in alpha/beta diversity?

(you could actually use pairwise-differences in q2-longitudinal to do pairwise testing on each sample to see whether batch has an effect on duplicated samples, though that will only detect a strong, directional impact).

If there is no apparent batch effect, merge without worry. I would only just outliers based on differences in alpha/beta diversity, not on sequence counts. I would remove those (potentially both samples, but more likely just the replicate that is an outlier from other samples) prior to merging.

I would not worry about differences in sequence count, unless if any of these have very low coverage (under 1000 reads? 500? you'll need to decide what's too low, and alpha rarefaction can help). Normalization will help control differences in sequencing depth at downstream steps.

Alex_14262 · March 16, 2018, 11:14am

Hello,

Thanks @ebolyen and @Nicholas_Bokulich for the help!

I thought the only thing rarefaction did was to allow you to check if you sampled enough the environment (or if the sequences/OTUs you have are representative of the environment in a way)? How would rarefaction take care of the outliers? The only output you get is a .qzv file as far as I know anyway, so wouldn't be able to use it later in the analysis.

The different runs of the same samples should be quite similar in terms of OTUs - that's what I expect anyway, and there shouldn't be batch effects really because same samples were sequenced twice because the DNA quantity wasn't that great. Plus, the protocol used is exactly the same. However I have one sample which has ~7 million sequences, when the rest have have around 250000 in one run. Then the exact same sample in the second run has the lowest values - ~4000 sequences when the previous lowest is around 11000, so it just made me question what's going on there.

As for keeping the samples different, how does that work? I read about it on another post but I don't really know how to implement it. Do I need to create new IDs for the second run and then having a category called run which can have two values? I would imagine then I need to change the manifest reimport the data and do dada2 again.

Finally, for using the group command would I be interested in grouping along samples or features? Following the example in the previous paragraph, i would imagine I have to choose "samples" and then specify for metadata-category "Run" so that it's grouped based on runs?

Thanks

Nicholas_Bokulich · March 16, 2018, 2:39pm

You are conflating alpha rarefaction with rarefying. Alpha rarefaction would not take care of any outliers. Normalization (rarefying) will control for uneven sampling effort, so the fact that you have some samples that have many many more sequences will not be an issue. Rarefying is the form of normalization currently supported for alpha and beta diversity analyses — it is built into the core-metrics pipeline, and is set with the sampling-depth parameter. Alpha rarefaction helps you decide what sequencing depth to use.

there may be batch effects between run 1 and run 2. So even if you are running the same plates on each run (duplicating the samples on each plate), there could be significant batch effects that I would check for first. Yes, the fact that all the same samples are on both runs will cause any batch effects to "smooth" out. And yes, there probably are not significant batch effects. But there can be — I have seen some horrendous examples in my day . Analyze as separate samples to rule out batch effects, then merge and proceed.

Doesn't matter. Different protocols would introduce the sort of blatant batch effects that completely invalidate any sort of comparison whatsoever. So when I say batch effects, I am really only talking about the subtle sort. The sort that occurs when you sequence the same exact samples on two different runs and somehow the results are different... usually this does not occur, but sometimes things go wrong and any time you are combining/comparing data across multiple runs you always need to check just to rule out errors before you proceed. It's better than discovering an error several months down the line...

That is bizarre, particularly if you just sequenced the same library twice (i.e., samples were PCRed once, pooled, and then the pooled sample was sequenced twice. Not that samples were PCRed twice, once for each run). But it happens. Again, analyze separately and rarefy at 4000 sequences. Keep an eye on those two samples to see how they compare to the other samples of similar sample type. If all looks okay, then I would not worry about merging these samples in spite of different sampling depths.

Yes

@thermokarst has provided a neat trick here: you can use feature table group to relabel your samples. Just create a new metadata column in your metadata file with the new sample IDs and use that as your metadata-column. Each sample will be relabeled with the corresponding unique ID (and then you can turn that column into the #SampleID column in the new metadata file that you will use with that feature table).

This is much better than the complicated workaround that I provided below (but still see steps 3, 5, and 6 to learn about merging/analyzing these relabeled data):

ouch. Yes, possibly. If you are adept with python, there are ways to re-label the IDs in your feature table using biom. But I am not sure there is a way to do this on the command line. Probably easier than re-importing and re-denoising, you could:
1. export your feature table from run 2 to biom, convert to TSV
2. rename the sample IDs, e.g., to make s1 into s1-2 etc. If your sample IDs follow a consistent pattern it should be easy to rename with find/replace in a text editor.
3. create a copy of your metadata file and rename the IDs accordingly. Merge this with your normal metadata file into a master metadata file.
~~4. convert your feature table to biom, re-import to QIIME2~~
5. merge your feature tables from run 1 and run 2 (the relabeled table)
6. use qiime diversity core-metrics to analyze

Yes, group by samples

I hope that helps!

Alex_14262 · March 20, 2018, 6:34pm

Hi!

Thanks for the explanation and suggestions given!

I discussed this with my supervisor and he thought that concatenating the two different runs of the same 96-plate prior to denoising might be more suitable because if the merging occurs after OTU assignment, he thought that some lower frequency sequences that are split between the two runs might not be grouped in the same OTU due to this reason (just like if there would be an imaginary threshold below which the sequences wouldn't be grouped together), and instead get assigned to other OTUs, or get thrown away. Is that the case?

Nicholas_Bokulich · March 21, 2018, 3:14pm

If you are using dada2 or deblur (denoising), then you do not need to worry about this and in the case of dada2 do NOT want to merge prior to denoising. These methods derive SVs and the feature IDs are based on the unique hash code of that sequence so will be merged with identical SVs from other runs that have the same parameters (primers, read length, processing pipeline).

If you are using de novo OTU picking methods, then your supervisor is correct. But only if you are using OTU picking and not denoising. This does not apply to closed-reference or open-reference OTU picking (which pick OTUs against a reference database that you re-use for future batches).

Good luck!

system · April 21, 2018, 9:21pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.