Deblur - merging multiple deblur analyses

Christina_Adler · May 27, 2019, 12:35pm

Hi,

I am using Deblur for denoising and wonder if anybody knows if it is appropriate or not to merge the outputs from multiple deblur runs?

I have the situation where I have already run Deblur on 2500 sequence libraries, and now am adding some further data (600 sequence libraries).

Do I need to:

a. Re-deblur the whole lot together (2500 + 600 libraries) or
b. Deblur the new 600 and merge with the results of the previously deblur'd 2500?

Thanks in advance to anyone who can provide insight.

Christina

jwdebelius · May 27, 2019, 12:39pm

Hi @Christina_Adler,

Welcome!

You can just deblur the new table and merge. The deblur algorithm produces an estimate based on an upper profile, so the protocol is actually identical across all run. You could also merge if you had run DADA2 or a closed reference OTU-pricking protocol. This does assume that you have the same read length. Different read lengths won't deblur and combine well. So, if you trimmed the first 2500 to 100bp, for instance, the new set must also be 100bp, even if you have a 300 bp read.

The one caveat has to do with hashed sequences. If you have hashed identifiers there is a small possibility that two unqiue sequences can produce the same hash id. I would just double check the hash size.

You likely will want to merge your rep sets, though, and build a new tree. I think all the current tree building algorithms are dependent on the sequences and so that you will need to redo.

and maybe do new taxonomic classification. (Although I suspect there's a way to just merge the taxonomic calls... maybe try the q2-metadata plugin?

Best,
Justine

Mehrbod_Estaki · May 28, 2019, 8:41am

Hi @Christina_Adler,
Just to add to @jwdebelius's answer, in order for the 2 different runs to be comparable at the ASV level they need to be of the same region and your trimming parameters need to be identical otherwise they will be assigned as unique features due to difference in their lengths.

Christina_Adler · June 3, 2019, 11:39am

Hi,

Thanks @jwdebelius and @Mehrbod_Estaki for your replies, they are super useful.

Apologies for my delayed reply, just been off sick.

It will be run on the same 16S region/primers and I will use the same cut-offs, so fingers crossed that will all be fine, and I will check the hash size. Thanks for those pointers.

All the best

Christina

mcreyno2 · June 9, 2019, 7:16am

Hi,

Thanks to all including original poster for great early discussion. This is nice confirmation myself, as I am processing through a similar workflow on two datasets that have been deblurred independently and wanting to merge them together prior to downstream analyses.

I do concur with suggestions by @Mehrbod_Estaki and @jwdebelius regarding trim length and sample amplicon region to not inflate ASV numbers.

But my question is more related to what you mean by "hashed sequences"/"hashed identifiers"? This is new vocabulary seen in 16S workflow and/or deblur workflow so I just want to make sure I'm not overlooking something. I did not receive any errors in Qiime2 12.2017 version steps of merging OTU tables and rep seqs independently. And further, feature numbers are consistent between ASV tables and rep seqs artifacts...

Thanks in advance for any insight on "hashed sequences"/"hashed identifiers" and relating to deblur/16S general workflow.

Best,

Mark R.

jwdebelius · June 9, 2019, 7:34am

Hi @mcreyno2,

First, welcome to the community!

In the old days we'd use identifer numbers (id OTU 12425). Now, its common, at least in QIIME to either use the sequence as a name or an MD5 hash with a hexidecimal digest of the sequence (hashed identifier) because it takes 150 or 400 characters and turns it into 16. (Someone else might be better able to explain the exact algorithm behind this but this is the basic idea).

Its rare that a small table will give you the merge issue, since the md5 space is almost unique. MD5 is still broadly used as a check-sum or other verification for a lot of data types. So, in most use cases, its not a problem. If you plan to do a big meta analysis or you're designing a database, you might want to keep the original sequence ID because that guarantees a unique identifier, though. In your case, it sounds like it was fine.

And, if you haven't updated yet, I'd recommend the 2019.1 and 2019.4 versions: there's lots of shiny new features and fun architecture.

Best,
Jutine

system · July 10, 2019, 1:34pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.