I am having trouble running dada2 and deblur on a single-end 16S rRNA gene amplicon dataset. I imported sequences to a .qza file with a manifest.csv file (attached) from several miseq runs plus Earth Microbiome Project sequences downloaded from NCBI. I attached my single-end-demux.qzv file. These are my command lines:
“Plugin error from deblur:
No sequences passed the filter. It is possible the trim_length (%d) may exceed the longest sequence, that all of the sequences are artifacts like PhiX or adapter, or that the positive reference used is not representative of the data being denoised.”
“Plugin error from dada2:
An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.
Debug info has been saved to /tmp/qiime2-q2cli-err-39b07k8_.log”
I attached the qiime2-q2cli-err-39b07k8_.log from the dada2 error, and attached the brief error output for deblur in a separate file.
I tried several things that did not fix this problem:
I lowered the –p-trim-length
I made sure the sample names did not contain unusual characters (dashes, underlines)
I checked the integrity of the fastq files withFastQValidator
I checked the integrity of the .gz fastq files with gzip -t
I checked the quality score format and they are all “Phred+33”
I made sure that my /tmp folder had plenty of memory associated with it
You are working with an exceptionally old version of QIIME 2 (we move fast around here!). Sorry to do this to you, but can you upgrade to 2018.8 and try this again?
Also, looking at your provenance, I see demux summarize took about 3 hours to run, which is about 2.75 hrs longer than I have ever seen before - any guess what happened there? Did you background the job while it was running?
I think I found a clue within the single-end-demux.qzv file you uploaded, within the ‘Interactive quality plot’ tab. When I hover around base 120, as you mentioned in your command, I get this warning:
The plot at position 120 was generated using a random sampling of 9792 out of 274443454 sequences without replacement. This position (120) is greater than the minimum sequence length observed during subsampling (114 bases). As a result, the plot at this position is not based on data from all of the sequences, so it should be interpreted with caution when compared to plots for other positions. Outlier quality scores are not shown in box plots for clarity.
Looks like many of your reads are quite short… but deblur should be able to handle that. Don’t tell anyone, but many of the seminal papers in this field were done with 90bp or shorter reads, and they still were half decent.
Out of curiosity, I looked at long reads.
The plot at position 233 was generated using a random sampling of 1876 out of 274443454 sequences without replacement. This position (233) is greater than the minimum sequence length observed during subsampling (114 bases). As a result, the plot at this position is not based on data from all of the sequences, so it should be interpreted with caution when compared to plots for other positions. Outlier quality scores are not shown in box plots for clarity.
So it looks like very few of your reads are the full length of 16S V4 (which is 250ish bp long). Hopefully the newest version of qiime will solve your problem, but if not, I bet the lengths indicate some soft of issue.
It is true that the EMP dataset has a lot of short reads in it, but the sequencing is deep (>25,000 reads), and I think most of the samples have enough longer reads (>120bp) for analysis. It sounds like the shorter reads are preventing dada2 and deblur from working. Is there a tool in QIIME2 to pre-screen out the shorter reads before using dada2 and deblur?
(Matthew Ryan Dillon)
The shorter reads will not prevent dada2 or deblur from working. Any reads shorter than your truncation length will be dropped automatically, so there is no need to pre-screen, just set an appropriate truncation length.
It seems more likely that two things are happening:
There is a very large number of sequences shorter than 120, so all are being dropped.
The remaining sequences contain too many errors and are being filtered out.
Could you please re-run qiime demux summarize with the latest release of QIIME 2 and share the result? The more recent releases include a length distribution summary that would be useful for assessing this.
Here is the .qzv file for the non-EMP samples, which account for about 40% the total samples and 20% of the sequences. There are plenty of long, high quality reads in these samples. So While there may be a large number of sequences shorter than 120, the remaining sequences don't seem to have a lot of errors.
Here are the .qzv files for the whole dataset (single-end-demux.qzv), and just the non-EMP samples (single-end-demux_cgrb.qzv) produced with v2018.8. Most of the reads are >120bp. Do you think the problem is that I'm using older versions of qiime2? The team that manages software on our bioinformatics server here at Oregon State is having a tough time with installation. It is fairly easy to install on a mac, but the analyses take longer and the hard drive gets full. I prepped these visualizations on my mac.
Maybe, but more specifically, older versions of deblur and dada2.
They are welcome to seek support here — we generally see HPC setups being pretty straightforward, but sometimes some environments have constraints that complicate things.
Regarding your demux summaries — these profiles look very strange to me — can you provide some more details about the sample prep, target, sequencing tech, and pre-processing steps? For example, it looks like maybe reads were concatenated together, which is going to cause all kinds of problems with DADA2, for example.
The EMP sequences were done with Illumina HiSeq single-end 150bp sequencing, demultiplexed (not sure how), and archived at NCBI, which is where I downloaded them from. The DNA was extracted with phenol-chloroform and amplified following EMP protocols.
The “cgrb” sequences were done with Illumina MiSeq paired-end 250bp sequencing and demultiplexed with Illumina software. I only included the forward reads in this analysis so they are comparable to the EMP sequences. The DNA extraction and amplification used the same procedures and the same PCR primers except that the primers were dual barcoded rather than single barcoded (which is EMP protocol I think).
I split the dataset into the longer CGRB reads and the shorter EMP reads. Deblur and dada2 both worked on the CGRB reads, despite the inclusion of the >250bp reads you mentioned, and both failed on the EMP reads with the same errors from my first post on this string. I used trimmomatic to remove EMP sequences <120 bp, and ran deblur and dada2 on the combined dataset. Dada2 has been running for several days and has not finished yet. Deblur failed with the same error as before. These are my command lines:
Sorry for the silence on our end — this issue is a bit of a mystery, especially because many in the community (myself included) have successfully used the EMP data in QIIME 2 analyses. I have not used the EMP data with deblur, and I downloaded from either qiita or directly from the EMP FTP, not NCBI, so these may be critical differences (e.g., if formatting the NCBI data is introducing some artifact?)
One thing worth noting: the EMP data on the FTP or qiita can come pre-deblurred. So you could just grab those data, deblur/trim your reads at the same lengths, and compare — or use something like q2-fragment-insertion to compare datasets of different lengths but equivalent processing conditions.
Now, to get to the bottom of this I have a few more questions:
Any progress on the dada2 data? If dada2 is succeeding it could help us triangulate this issue…
have you been using trimmomatic all along? If so, could you try without trimmomatic? (again, deblur will do the trimming for you, and drop all sequences shorter than the trim length). I wonder if trimmomatic could be the problem — I believe we have seen other issues caused by trimmomatic in the past, though I would expect those issues to become apparent at importing.
I agree, quality scores look okay and lengths are fine.
Could you possibly post the first 5 sequences or so? I just want to inspect to make sure there are not e.g., formatting issues introduced by trimmomatic or NCBI that the importer is not picking up.