Dada2 variable sequence length

I’m having an issue with not being able to filter my demultiplexed data because not all of the reads are the same length. Is this normal, or am I doing something wrong? I didn’t use the demultiplex in qiime2 (I used trimmomatic) because it required a barcode for every single sequence and I have a text file of 100,000s of sequences that all use the same barcode. I feel like I may be doing something wrong in this case. I’m extremely new to this type of work, or anything computational really, and qiime is one of the first pipelines I’ve tried using, so forgive me for not knowing a whole lot.

Hi @Partisan! We need a bit more information to help you out here:

  • What version of QIIME 2 are you using?
  • Can you please provide the exact commands you are running.
  • Can you please provide any errors that are reported when running these commands, including the detailed error logs provided by QIIME 2.

Thanks!! :tada:

Hi @thermokarst!

  • I’m using qiime2-2017.7

  • Here is the exact command I tried to run

   $ qiime dada2 denoise-paired \
   --i-demultiplexed-seqs paired-end-demux_C.qza \
   --o-table table_C \
   --o-representative-sequences rep-seqs_C \
   --p-trunc-len-f 0 \
   --p-trunc-len-r 0 \
   --p-n-reads-learn 100000 \
   --p-n-threads 0 \
   --verbose
  • Here is the full error log that I received
> Running external command line application(s). This may print messages to stdout and/or stderr.
> The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
> 
> Command: run_dada_paired.R /tmp/tmp1skrhjvt/forward /tmp/tmp1skrhjvt/reverse /tmp/tmp1skrhjvt/output.tsv.biom /tmp/tmp1skrhjvt/filt_f /tmp/tmp1skrhjvt/filt_r 0 0 0 0 2.0 2 consensus 1.0 0 100000
> 
> R version 3.3.1 (2016-06-21) 
> Loading required package: Rcpp
> Warning messages:
> 1: multiple methods tables found for ‘arbind’ 
> 2: multiple methods tables found for ‘acbind’ 
> 3: replacing previous import ‘IRanges::arbind’ by ‘SummarizedExperiment::arbind’ when loading ‘GenomicAlignments’ 
> 4: replacing previous import ‘IRanges::acbind’ by ‘SummarizedExperiment::acbind’ when loading ‘GenomicAlignments’ 
> 5: multiple methods tables found for ‘left’ 
> 6: multiple methods tables found for ‘right’ 
> DADA2 R package version: 1.4.0 
> 1) Filtering ...
> 2) Learning Error Rates
> Not all sequences were the same length.
> Traceback (most recent call last):
>   File "/home/anthony/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 179, in denoise_paired
>     run_commands([cmd])
>   File "/home/anthony/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 35, in run_commands
>     subprocess.run(cmd, check=True)
>   File "/home/anthony/miniconda3/envs/qiime2-2017.7/lib/python3.5/subprocess.py", line 398, in run
>     output=stdout, stderr=stderr)
> subprocess.CalledProcessError: Command '['run_dada_paired.R', '/tmp/tmp1skrhjvt/forward', '/tmp/tmp1skrhjvt/reverse', '/tmp/tmp1skrhjvt/output.tsv.biom', '/tmp/tmp1skrhjvt/filt_f', '/tmp/tmp1skrhjvt/filt_r', '0', '0', '0', '0', '2.0', '2', 'consensus', '1.0', '0', '100000']' returned non-zero exit status -9
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File "/home/anthony/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/q2cli/commands.py", line 222, in __call__
>     results = action(**arguments)
>   File "<decorator-gen-264>", line 2, in denoise_paired
>   File "/home/anthony/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/qiime2/sdk/action.py", line 201, in callable_wrapper
>     output_types, provenance)
>   File "/home/anthony/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/qiime2/sdk/action.py", line 334, in _callable_executor_
>     output_views = callable(**view_args)
>   File "/home/anthony/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 194, in denoise_paired
>     " and stderr to learn more." % e.returncode)
> Exception: An error was encountered while running DADA2 in R (return code -9), please inspect stdout and stderr to learn more.
> 
> Plugin error from dada2:
> 
>   An error was encountered while running DADA2 in R (return code -9),
>   please inspect stdout and stderr to learn more.
> 
> See above for debug info.

Hi @Partisan,

Can you please describe your data a bit more:

  1. how many samples do you have in your study?
  2. are these 16S sequences?

Just a note, as of QIIME 2 2017.6, the q2-dada2 plugin supports variable length sequences. The log you attached doesn’t appear to actually be related to variable length sequences, I think that message just happened to be the last thing emitted before DADA2 died.

Thanks!

  1. For this particular run there were 3 samples.
  2. These are not 16S sequences. I’m looking for bacterial DNA in another organism by using DNA that was already pulled from that organism.

Hi @Partisan — exit code -9 is a bit of a grab-bag - we have seen this crop up in a couple of different ways, but perhaps the most common is related to memory/disk space. If you are able to run this on a machine with more RAM, that might be useful for diagnosing, otherwise you could try to process a subset of samples. Maybe @benjjneb has some other non-hardware related thoughts. Thanks!

My guess on this one is memory, as this fails in the 2) Learning Error Rates step which has the highest memory usage, but its just a guess. These -9 exit codes are mysterious ones.

@Partisan how many reads are in these input files? And how much memory did this job have access too?

The text files are around 4-5 GB each (three total) and I only used 16 GB of ram. I may have access to a larger cluster with more RAM, but I have yet to try installing qiime on there. I guess now is as good a time as ever to try it all out.

Sorry, I should be exact in my file sizes if I want as much help as possible. The demultiplexed file (paired-end-demux_C.qza, which was my original input for the command in my second post) is 4.2GB. This file contains merged reads from three samples. Each of the three merged read files are 1.1GB, 1.4GB, and 1.3GB respectively. So instead of using all three merged files to create the one paired-end-demux.qza, should I attempt to just use one sample at a time, meaning I’ll end up with three separate paired-end-demux.qza files? And if I do it that way, will I still be able to draw comparisons between all three of my samples?

I will also be attempting to install qiime onto our cluster, but in the mean time, I’ll probably try this as well. I just wanted to know if this is a valid approach, as in it would still make sense at the end.

Assuming each of these 3 samples were sequenced in the same run/lane, you should denoise them together as they will share an error model, and QIIME 2 doesn't expose a way to share the error model between invocations. Incidentally, I don't think you would actually end up using any less memory as we analyze each sample independently within the method. So even if you split out your data for each sample, it would be the same thing that our q2-dada2 plugin does internally.

How long are your reads, and what are they of exactly? Are these amplicon sequences still?

Since it does seem like memory is probably the issue you could experiment with setting the n-reads-learn parameter, but because we don't have a way to view the error model generated, I would be cautious of setting it too low (the default is to use a million reads).

The reads are between 36 and 100 bp, and they are from a cDNA library made from RNA that was extracted from different tissue types in an oyster. The goal was to be able to use the cDNA library and match it to bacteria. And they are not amplicon sequences.

Since my last post I was able to set up qiime2 2017.7 on our cluster, so memory would no longer be an issue. My -p-n-reads-learn parameter was set to 100,000 and not a million reads, but I’ve since changed that. However I’m running into a different issue now, although I’m not sure this is still a qiime issue but rather an issue with our setup for the cluster. So the problem is the program seems to stop doing anything or just idle after step 2a). There are no errors being output, it just stops after completing the forward reads. The reason I feel like it’s hanging up is because it didn’t take long (2-3 hours) to get past step 2a), but after another 12 hours, it still hasn’t progressed any further. I’ve ran other jobs alongside the original, and the same issue happens every single time. Maybe I just need to be patient?

It is my current understanding that dada2 only works on amplicon sequences (@benjjneb, is that correct?). Which may explain this part:

I think you are doing things that are a little beyond QIIME 2 at this point in time as we don't really have any "upstream" analysis techniques for non-amplicon data yet. You should be able to use other analysis tools/pipelines that are better suited for your data and import the resulting "features" as a FeatureTable[Frequency], but I don't have any specific advice for that.

Correct. DADA2 is just for amplicon data.

In rough high-level terms: The DADA2 algorithm presumes that the number of unique error-free biological sequences was much less than the number of sequencing reads. This is true for amplicon data (which come from a variety of taxa but the same genetic locus) but not for whole-genome "shotgun" data (which come from a variety of taxa and any position on the genome). Hence the algorithm is probably "blowing up" and failing to complete in any reasonable time on this data.

Edit: And for later reference, that probably also explains the -9 error, which probably did come from memory eventually exceeding capacity.

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.