Merging outputs from Q2-DADA2

nerdynella · August 7, 2017, 6:25pm

I have split my HiSeq data into smaller chunks to run as a batch/array job to save time.
Before I proceed, will it be possible to merge each generated repseq.qza and table.qza into one repseq.qza and table.qza respectively or alternatively extract and merge contents of each generated .qza file?
Thanks,
Nsa

thermokarst · August 8, 2017, 12:58pm

Hi @nerdynella!

Yes!! Check out the "FMT Tutorial" on the docs, particularly this section. One thing to note is that if you split your data into say 4 groups, you would need to run these commands 3 (4-1) times:

$ qiime feature-table merge \
  --i-table1 table-1.qza \
  --i-table2 table-2.qza \
  --o-merged-table merged-table.qza
$ qiime feature-table merge \
  --i-table1 merged-table.qza \
  --i-table2 table-3.qza \
  --o-merged-table merged-table.qza
$ qiime feature-table merge \
  --i-table1 merged-table.qza \
  --i-table2 table-4.qza \
  --o-merged-table merged-table.qza

Hope that helps!

EDIT: There is an open issue on the bug tracker to support variadic inputs, which would theoretically allow a method like feature-table merge to merge multiple tables at the same time.

There might be some implications to splitting these data when it comes to denoising, I will ping @benjjneb (DADA2) and @wasade (deblur) to see if they have anything to say on the matter. Thanks!

benjjneb · August 8, 2017, 1:31pm

When using exact sequence variant methods it is fine to process subsets of the samples independently.

On the dada2 side you want each subset to have enough reads to be able to get the error rates right, but there are way more than enough reads in a Hiseq run for that purpose so splitting is AOK.

nerdynella · August 8, 2017, 1:41pm

excellent! Thank you @thermokarst and @benjjneb for your quick response.
Cheers,
Nsa

wasade · August 8, 2017, 4:31pm

Hey @nerdynella, you do not need to split HiSeq data for q2-deblur. It splits internally and processes each sample using a static error model that is not subject to run-to-run variation. For context, Deblur on the American Gut dataset, which spans 15,000 samples from around 50 MiSeq runs, takes 8 hours using 10 cores.

nerdynella · August 8, 2017, 4:44pm

Thank you @wasade. How do I handle PE reads with Deblur?

wasade · August 8, 2017, 5:31pm

Deblur is agnostic; join upstream.

It is not clear if joining reads is a benefit or detriment in amplicon studies, and I'm not aware of an independent benchmarking study which has explored this. Recall that genus level differentiation using naive Bayes is not great even with longer reads Wang et al 2007. And, it greatly increases the number of errors as reverse is lower quality, it reduces the number of reads per sample due to quality filtering, and misassembly is possible.

nerdynella · August 8, 2017, 10:54pm

Thank you @wasade I'll explore Deblur using my fwrd reads and compare the results to those from DADA2 PE.
cheers,
Nsa

nerdynella · August 9, 2017, 2:37pm

@wasade please how do i specify number of threads using Q2-Deblur? or is it set to automatically utilize all threads?
Thanks,
Nsa

wasade · August 9, 2017, 5:11pm

You can specify the number of threads/jobs with --p-jobs-to-start.

Best,
Daniel

ebolyen · December 22, 2017, 5:35pm

In the new QIIME 2 2017.12 release, feature-table merge can now accept arbitrarily many tables to merge!