Question about using deblur in meta analysis

anna-schrecengost · February 19, 2021, 6:28pm

Hi, I'm working on a meta analysis of ~25 Illumina 18S rRNA amplicon datasets (all from different studies and different anoxic marine enviros) and have a question about when it is appropriate to merge the data.

Since deblur runs a static error model it should be fine to deblur all of the cleaned, trimmed, and merged sequences from all of the studies together. I am running v4 and v9 studies separately and making sure the data that I'm running together is of the same length and region of the gene. So my plan is to pre process the reads outside of qiime2 and then import them all together to speed up and simplify my pipeline.

Is this reasonable? I have seen some examples in the literature of similar meta-analyses using deblur that denoise all of the studies separately and then merge them with merge-seq (e.g. https://www.nature.com/articles/s41396-020-00814-9#MOESM3). I haven't seen examples of what I am planning to do and want to make sure everything I do is justified especially since i am so new to bioinformatics.

Thank you!

jwdebelius · February 23, 2021, 4:08pm

Hi @anna-schrecengost,

I would recommend doing them separately, mostly for expediency. If you only have access to your local machine, it probably doesn't matter. If you have access to a server, EC2, or some other shared compute resource, it can be faster to run in parallel with shared settings. (I like snakemake to generate everything with the same parameters) and then merge. MergDFAing first shouldn't change anything with deblur, but it may be more efficient. My one concern with deblur (as someone who admitted doesn't work much with 18S) is checking that the post denoising QC step works with 18S; you may needd to dig into the docs futher.
Personally, my recommendation is also to process in QIIME 2. (I would lose my head somedays if it wasn't connected to my neck, even with notebooks, I'm not always sure where exactly things came from if my pipeline shifts slightly. YMMV).

If you choose to combine the data with DADA2 you should process those separately with the same settings. DADA2 builds an error model based on the provided sequences, and wants to be run on a per-sequencing run basis (in this case, probably per-study but possibly more.)

I would actually also recommend considering OTU clustering at a high identity against a consistent database. It's not perfect, but lets you scaffold everything more easily.

Best,
Justine

jwdebelius · February 24, 2021, 4:20pm

One of the other awesome mods pointed out that to deblur, you specifically need to use the denoise other function.

Best,
Justine

system · March 27, 2021, 10:20pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.