Hi there,
I am running Dada2 for a set of paired-end Illumina sequencing where I have soil samples (very diverse) and isolated samples (one or two species). Eg.: sample 1 came from soil; sample 2 came from an isolate and so on.
What I found is that Dada2 filter out almost all the sequences from my isolated samples while it keeps many sequences from the soil ones.
command (took about 10h)
qiime dada2 denoise-paired
–i-demultiplexed-seqs biblios-33-demux.qza
–o-table table-biblios-pe-demux-dada2
–o-representative-sequences rep-seqs-biblios-dada2
–p-trunc-len-f 224 \
–p-trunc-len-r 223
–p-n-threads 14
–p-n-reads-learn 1000000
–p-chimera-method consensus
–output-dir out_dada2
** I have checked the qualities outside qiime by creating a histogram of sizes from all the libraries and both f and r trunc-len represent the smaller representative size of reads I have inside those libraries. So, I expected, to not loose reads because of trunc. Here is an example of Qiime output for the sizes**
Demultiplexed sequence length summary
Forward Reads
|Total Sequences Sampled|10000|
|2%|227 nts|
|9%|227 nts|
…
|98%|234 nts|
Reverse Reads
|Total Sequences Sampled|10000|
|2%|223 nts|
|9%|223 nts|
…
|98%|230 nts|
Dada2 output
sample-id input filtered denoised merged non-chimeric
IG002 49827 41058 41058 13 13 # isolated bacteria
IG037 816626 685547 685547 256971 231045 # soil sample
I then thought that would be a result of the large number of identical sequences in the isolated sample and that, after downstream tax-classification, I would retrieve the species I know it’s there (or at least the genera). It is supposed to be a Bacillus samples. nevertheles what I found is that those 13 sequences are Ochrobactrum sp. I would not say there are no Ochrobactrum there (who knows) but I am sure there are Bacillus.
I made another test to check whether the merging step by Dada2 would be the problem here. As you can see in the output above read number drastically falls after this step. I pre-merged the reads and run Dada2 denoise-single and that’s what I’ve got:
sample-id input filtered denoised non-chimeric
IG002merged 49827 39928 39928 39928 # wow
IG037merged 816626 661909 661909 575996 # more reads retrieved
The donwstream classification results look like that for the isolated sample:
D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;__ 39904.0 —> 99% Bacillus sp.
Still, the single-end Dada2 took 10x more time to run (56h)
Now I don’t know the why I am loosing too many reads before the Dada2 merging step.
Also, I would like to ask if there is any way of running the classification with no dereplication step so my pipeline could run faster, if that makes sense…
Thanks in advance and I am learning a lot reading this forum!