Plugin errors when running deblur and dada2 on large single-end 16S dataset

Byron_C_Crump · September 19, 2018, 11:50pm

I am having trouble running dada2 and deblur on a single-end 16S rRNA gene amplicon dataset. I imported sequences to a .qza file with a manifest.csv file (attached) from several miseq runs plus Earth Microbiome Project sequences downloaded from NCBI. I attached my single-end-demux.qzv file. These are my command lines:

qiime dada2 denoise-single --i-demultiplexed-seqs single-end-demux.qza --p-trim-left 0 --p-trunc-len 120 --p-n-threads 0 --p-chimera-method consensus --o-representative-sequences rep-seqs-dada2.qza --o-table table-dada2.qza

qiime deblur denoise-16S --i-demultiplexed-seqs single-end-demux.qza --p-trim-length 120 --p-jobs-to-start 20 --p-sample-stats --verbose --o-representative-sequences trimmed/deblur/rep-seqs.qza --o-table trimmed/deblur/table.qza --o-stats trimmed/deblur/deblur-stats.qza

These are the errors:

“Plugin error from deblur:
No sequences passed the filter. It is possible the trim_length (%d) may exceed the longest sequence, that all of the sequences are artifacts like PhiX or adapter, or that the positive reference used is not representative of the data being denoised.”

“Plugin error from dada2:
An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.
Debug info has been saved to /tmp/qiime2-q2cli-err-39b07k8_.log”

I attached the qiime2-q2cli-err-39b07k8_.log from the dada2 error, and attached the brief error output for deblur in a separate file.

I tried several things that did not fix this problem:

1. I lowered the –p-trim-length
1. I made sure the sample names did not contain unusual characters (dashes, underlines)
1. I checked the integrity of the fastq files withFastQValidator
1. I checked the integrity of the .gz fastq files with gzip -t
1. I checked the quality score format and they are all “Phred+33”
1. I made sure that my /tmp folder had plenty of memory associated with it

dada2 worked on a subset of the samples (one set of EMP sequences) but deblur failed on that same dataset. Do you see where I am going wrong? These commands work fine on the “moving pictures” tutorial.
LTREB_qiime2_manifest_single-end.csv (430.8 KB)
qiime2-deblur_error.txt (1.1 KB)
qiime2-q2cli-err-39b07k8_.txt (19.9 KB)
single-end-demux.qzv (342.3 KB)

thermokarst · September 20, 2018, 1:05am

Hey there @Byron_C_Crump!

You are working with an exceptionally old version of QIIME 2 (we move fast around here!). Sorry to do this to you, but can you upgrade to 2018.8 and try this again?

Also, looking at your provenance, I see demux summarize took about 3 hours to run, which is about 2.75 hrs longer than I have ever seen before - any guess what happened there? Did you background the job while it was running?

Keep us posted! :qiime2:

colinbrislawn · September 20, 2018, 1:12am

Hello Byron,

Now that's the kind of log files I love to see! Thanks for posting all of that.

EDIT: looks like Matt beat me to the post. I second him. Update your Qiime2!

I think I found a clue within the single-end-demux.qzv file you uploaded, within the 'Interactive quality plot' tab. When I hover around base 120, as you mentioned in your command, I get this warning:

The plot at position 120 was generated using a random sampling of 9792 out of 274443454 sequences without replacement. This position (120) is greater than the minimum sequence length observed during subsampling (114 bases). As a result, the plot at this position is not based on data from all of the sequences, so it should be interpreted with caution when compared to plots for other positions. Outlier quality scores are not shown in box plots for clarity.

Looks like many of your reads are quite short... but deblur should be able to handle that. Don't tell anyone, but many of the seminal papers in this field were done with 90bp or shorter reads, and they still were half decent.

Out of curiosity, I looked at long reads.

The plot at position 233 was generated using a random sampling of 1876 out of 274443454 sequences without replacement. This position (233) is greater than the minimum sequence length observed during subsampling (114 bases). As a result, the plot at this position is not based on data from all of the sequences, so it should be interpreted with caution when compared to plots for other positions. Outlier quality scores are not shown in box plots for clarity.

So it looks like very few of your reads are the full length of 16S V4 (which is 250ish bp long). Hopefully the newest version of qiime will solve your problem, but if not, I bet the lengths indicate some soft of issue.

Keep in touch!
Colin

Nicholas_Bokulich · September 20, 2018, 1:19pm

An off-topic reply has been split into a new topic: Dada2 Error: package ‘Rcpp’ was installed by an R version with different internals

Please keep replies on-topic in the future.

Byron_C_Crump · September 20, 2018, 1:19pm

It is true that the EMP dataset has a lot of short reads in it, but the sequencing is deep (>25,000 reads), and I think most of the samples have enough longer reads (>120bp) for analysis. It sounds like the shorter reads are preventing dada2 and deblur from working. Is there a tool in QIIME2 to pre-screen out the shorter reads before using dada2 and deblur?

Nicholas_Bokulich · September 20, 2018, 4:06pm

The shorter reads will not prevent dada2 or deblur from working. Any reads shorter than your truncation length will be dropped automatically, so there is no need to pre-screen, just set an appropriate truncation length.

It seems more likely that two things are happening:

There is a very large number of sequences shorter than 120, so all are being dropped.
The remaining sequences contain too many errors and are being filtered out.

Could you please re-run qiime demux summarize with the latest release of QIIME 2 and share the result? The more recent releases include a length distribution summary that would be useful for assessing this.

Byron_C_Crump · September 20, 2018, 5:37pm

Here is the .qzv file for the non-EMP samples, which account for about 40% the total samples and 20% of the sequences. There are plenty of long, high quality reads in these samples. So While there may be a large number of sequences shorter than 120, the remaining sequences don't seem to have a lot of errors.

single-end-demux_cgrb.qzv (307.1 KB)

Nicholas_Bokulich · September 20, 2018, 5:40pm

That visualization was produced with version 2017.12. Please install the latest release (2018.8), which will show the actual length distribution in this visualization. Thanks!

Byron_C_Crump · September 20, 2018, 10:00pm

Here are the .qzv files for the whole dataset (single-end-demux.qzv), and just the non-EMP samples (single-end-demux_cgrb.qzv) produced with v2018.8. Most of the reads are >120bp. Do you think the problem is that I'm using older versions of qiime2? The team that manages software on our bioinformatics server here at Oregon State is having a tough time with installation. It is fairly easy to install on a mac, but the analyses take longer and the hard drive gets full. I prepped these visualizations on my mac.

single-end-demux_cgrb.qzv (310.3 KB)
single-end-demux.qzv (346.0 KB)

thermokarst · September 20, 2018, 10:13pm

Maybe, but more specifically, older versions of deblur and dada2.

They are welcome to seek support here --- we generally see HPC setups being pretty straightforward, but sometimes some environments have constraints that complicate things.

Regarding your demux summaries --- these profiles look very strange to me --- can you provide some more details about the sample prep, target, sequencing tech, and pre-processing steps? For example, it looks like maybe reads were concatenated together, which is going to cause all kinds of problems with DADA2, for example.

Byron_C_Crump · September 20, 2018, 10:31pm

The EMP sequences were done with Illumina HiSeq single-end 150bp sequencing, demultiplexed (not sure how), and archived at NCBI, which is where I downloaded them from. The DNA was extracted with phenol-chloroform and amplified following EMP protocols.

The "cgrb" sequences were done with Illumina MiSeq paired-end 250bp sequencing and demultiplexed with Illumina software. I only included the forward reads in this analysis so they are comparable to the EMP sequences. The DNA extraction and amplification used the same procedures and the same PCR primers except that the primers were dual barcoded rather than single barcoded (which is EMP protocol I think).

thermokarst · September 21, 2018, 5:20pm

Hmm, this isn't consistent with what I see in the viz:

This viz shows reads up to 301 nts long, which looks to me like pre-joined paired-end reads, which should not be used with DADA2. This has been discussed at length on this forum.

This also doesn't quite seem to line up with the corresponding viz you provided:

This viz shows single end reads (which is okay, you can import PE data as SE), but these reads are up to 301 nts long, which seems longer than the 250 you mentioned above.

Byron_C_Crump · September 24, 2018, 11:29pm

I split the dataset into the longer CGRB reads and the shorter EMP reads. Deblur and dada2 both worked on the CGRB reads, despite the inclusion of the >250bp reads you mentioned, and both failed on the EMP reads with the same errors from my first post on this string. I used trimmomatic to remove EMP sequences <120 bp, and ran deblur and dada2 on the combined dataset. Dada2 has been running for several days and has not finished yet. Deblur failed with the same error as before. These are my command lines:

qiime dada2 denoise-single --i-demultiplexed-seqs single-end-demux.qza --p-trim-left 0 --p-trunc-len 120 --p-n-threads 0 --p-chimera-method consensus --o-representative-sequences rep-seqs-dada2.qza --o-table table-dada2.qza --o-denoising-stats stats-dada2_cgrb.qza

qiime deblur denoise-16S --i-demultiplexed-seqs single-end-demux_trim.qza --p-trim-length 120 --p-jobs-to-start 20 --p-sample-stats --verbose --o-representative-sequences deblur-rep-seqs_trim.qza --o-table deblur-table_trim.qza --o-stats deblur-stats_trim.qza

Here is the .qzv file for the combined dataset without <120bp EMP sequences

single-end-demux_trim.qzv (346.6 KB)

Here is the .qzv of the CGRB dataset

single-end-demux_cgrb.qzv (307.1 KB)

Here is the .qzv of the original EMP dataset

single-end-demux_emp.qzv (300.1 KB)

Here is the .qzv of the trimmed EMP dataset

single-end-demux_emp_trim.qzv (319.8 KB)

Do you see anything that could explain why I cannot run dada2 and deblur on the EMP dataset? All the sequences are long enough, and the quality scores look OK to me.

Byron

Nicholas_Bokulich · September 26, 2018, 9:26pm

Hi @Byron_C_Crump,

Sorry for the silence on our end — this issue is a bit of a mystery, especially because many in the community (myself included) have successfully used the EMP data in QIIME 2 analyses. I have not used the EMP data with deblur, and I downloaded from either qiita or directly from the EMP FTP, not NCBI, so these may be critical differences (e.g., if formatting the NCBI data is introducing some artifact?)

One thing worth noting: the EMP data on the FTP or qiita can come pre-deblurred. So you could just grab those data, deblur/trim your reads at the same lengths, and compare — or use something like q2-fragment-insertion to compare datasets of different lengths but equivalent processing conditions.

Now, to get to the bottom of this I have a few more questions:

Any progress on the dada2 data? If dada2 is succeeding it could help us triangulate this issue...

have you been using trimmomatic all along? If so, could you try without trimmomatic? (again, deblur will do the trimming for you, and drop all sequences shorter than the trim length). I wonder if trimmomatic could be the problem — I believe we have seen other issues caused by trimmomatic in the past, though I would expect those issues to become apparent at importing.

I agree, quality scores look okay and lengths are fine.

Could you possibly post the first 5 sequences or so? I just want to inspect to make sure there are not e.g., formatting issues introduced by trimmomatic or NCBI that the importer is not picking up.

Thanks!