Dada2 not picking up all samples in artifact?

zackljones · February 27, 2018, 3:01pm

Hi,

I am just getting started with Qiime2 was able to get my demultiplexed paired end data imported to an artifact. I am now trying to run through the dada2 pipeline and running it in verbose as follows

(qiime2-2018.2) jay@Workstation:~$ qiime dada2 denoise-paired --i-demultiplexed-seqs /home/jay/zack/matt_dioxin/dioxin.qza --p-trunc-len-f 295 --p-trunc-len-r 235 --p-n-threads 0 --output-dir /home/jay/zack/matt_dioxin/dada2/ --verbose
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /tmp/tmpt1uq1mkd/forward /tmp/tmpt1uq1mkd/reverse /tmp/tmpt1uq1mkd/output.tsv.biom /tmp/tmpt1uq1mkd/filt_f /tmp/tmpt1uq1mkd/filt_r 295 235 0 0 2.0 2 consensus 1.0 0 1000000

R version 3.4.1 (2017-06-30)
Loading required package: Rcpp
DADA2 R package version: 1.6.0

Filtering ....................
Learning Error Rates
2a) Forward Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 92831 reads in 39069 unique sequences.
Sample 2 - 92581 reads in 38028 unique sequences.
Sample 3 - 71930 reads in 30065 unique sequences.
Sample 4 - 105089 reads in 43336 unique sequences.
Sample 5 - 87661 reads in 28261 unique sequences.
Sample 6 - 60902 reads in 21726 unique sequences.
Sample 7 - 62298 reads in 29719 unique sequences.
Sample 8 - 53804 reads in 29833 unique sequences.
Sample 9 - 92147 reads in 36507 unique sequences.
Sample 10 - 74698 reads in 36155 unique sequences.
Sample 11 - 76370 reads in 33265 unique sequences.
Sample 12 - 73091 reads in 36547 unique sequences.
Sample 13 - 147930 reads in 31751 unique sequences.
selfConsist step 2
selfConsist step 3
selfConsist step 4
selfConsist step 5
selfConsist step 6

This all seems great except for I have 20 samples with forward and reverse reads so why is it only showing 13 being processed? I exported my artifact and all 20 samples (40 total fastq files) were exported so it seems they were all included on the initial import. Is there something I am missing? Do I need to restart the dada2 command or just let it finish and check the output?

Thanks,
Zack

Mehrbod_Estaki · February 27, 2018, 6:48pm

Hi @zackljones,
While I can't speak for why or how the verbose output selects those outputs in particular (experts?), from a user's perspective I can say that is totally normal and all your samples will ultimately be denoised. I've wondered that myself in the past but you can move ahead with your analysis assuming they've all been processed.

zackljones · February 27, 2018, 9:58pm

Hi @Mehrbod_Estaki,

Thanks for the reply that makes me feel better as I was not looking forward to redoing a multi-day step.

Cheers!
Zack

ebolyen · February 27, 2018, 10:02pm

Hi @zackljones and @Mehrbod_Estaki,

The reason for this is that there is a parameter --p-n-reads-learn which is the number of reads to use to estimate the error model.

By default this is set to 1,000,000 and once it has acquired that many reads, it stops looking for more.

In practice this means you will see it run through a couple (or even most) of your samples before it has captured enough reads. Although I've seen this step completed in as few as 2 samples for very large datasets.

zackljones · February 28, 2018, 3:16am

Hi @ebolyen,

Thanks for the clarification that makes sense as I have about 100k reads per sample.

Cheers,
Zack