Plugin errors when running deblur and dada2 on large single-end 16S dataset

Byron_C_Crump · September 26, 2018, 11:41pm

Hmmm. I downloaded from NCBI because I could not get the sequences I downloaded from EMP FTP to work in qiime2. I could try the pre-deblurred files, but I don't understand how to combine them with my own deblurred files in qiime2 before making an OTU table.

Here are answers.

dada2 is still running, and I dont know how to check its status.
I only started using trimmomatic for this last attempt. I always included a trim level in my deblur commands (--p-trim-length 120), but since that didnt seem to be working I tried trimmomatic. I have since tried to trim the EMP dataset to 140bp before running deblur. This removed about half of the EMP sequences, but deblur still did not work with the same error.

Here are the first 5 sequences in one of the EMP files.

@1883.2000.001.Crump.Artic.LTREB.main.lane1.NoIndex_0
TACGTAGGGTGCAGGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTCTTAAGTCAGATGTGAAAGCCCCGGGCTTAACCTGGGAACTGCGTTTGAAACTGGGAGACTTGAGTGTGGCAG
+1883.2000.001.Crump.Artic.LTREB.main.lane1.NoIndex_0
CCCFFFFFHHHHHIIIJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJIJGGFFDDDDDBDDDDEDEEEDDDDEDDD@CDDDBDDDDDDDDDDDDDBA?@CC>@5<@D::4@CCAABB><CCCACCDDD?BC?
@1883.2000.001.Crump.Artic.LTREB.main.lane1.NoIndex_1
TACAGAGGGCTCGAGCGTTAATCGGAATCACTGGGCTTAAAGCGTGCGTAGGCGGATCTTCAGGCTTGTTGTGAAATCCCACGGCTCAACCGTGGAATTGCGATGAGAACCGGAGATCTTGAGTCAGGTAGAG
+1883.2000.001.Crump.Artic.LTREB.main.lane1.NoIndex_1
BBBFDFDDHFHDHHIGHIJJJIJJJIJJIJJJJJJFIHHGGHHIJJJHHHFFCADDBDDDEDDDDDDDDDBDDCDACCDDDDBDBBBDDDD>BDDDDDDDCD<A<CDCC@A9<><@CDACCA>(4>:<>>:@A
@1883.2000.001.Crump.Artic.LTREB.main.lane1.NoIndex_2
TACATAGGGTGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGTCGTTTGTTACGTCGGATGTGAAAACCTGAGGCTCAACCTCAGGCCTGCATTCGATACGGGCAAACTAGAGTTTGGTAGGGGAGACTGG
+1883.2000.001.Crump.Artic.LTREB.main.lane1.NoIndex_2
C@CFFFFFHHGHHJGGIJJGJJHHIIJJJJJJIJJJJJJJJGGAGIIGGGIJ@EHHFFFEEEEABDDDBDDDDEEACCBCBDCB@D?CDDD?CDAC?BDDDDDEDCDCA.8?<9<???@ACCC(4>@BB@(<?BD<.8>@?
@1883.2000.001.Crump.Artic.LTREB.main.lane1.NoIndex_3
GACGAACCGTACAAACGTTATTCGGAATCACTGGGCTTAAAGGGTGCGTAGGCTGCGCGGTAAGTTGGGTGTGAAAGCCCTCGGCTCAACCGAGGAACTGCGCTCAAAACTACCGTGCTGGAGGGAGACAGAGGTGA
+1883.2000.001.Crump.Artic.LTREB.main.lane1.NoIndex_3
CCCFFFFFHH:2AEBEFEGIIJFGJIIJGIJJJJJIJIIIJJICHIJJJJJJFHGHFFDD>BBDDDDAB?BBDDC>CCD?CDDDDDDDDDBBD9@BDDCC(5<@<BC@@?C>:958892:38+29<99?CCC>2(<C
@1883.2000.001.Crump.Artic.LTREB.main.lane1.NoIndex_4
TACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTTGTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGGACTGCGTTTGAAACTACAAGACTAGAGTGTAGCAGAG
+1883.2000.001.Crump.Artic.LTREB.main.lane1.NoIndex_4
CCCFFFFFHHHFFBHGHGIIHHEIJJIGJIHJJJJIJJJIJIJJJJHFFFDDDDDBBBBBDEEDEEEDDDDDDDE:ACCCDDDDDBBDDDDDABBB@@BDCB@BD:@C@C>C:?CADDC:::4>ACCDDD

And here are the first three sequences from a CGRB file
@M01498:88:000000000-AA3KR:1:1101:12579:1444 1:N:0:2
TACGGAGGGTGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGCGCGTAGGCGGTTTGTAAAGTTGGAAGTGAAATCCCGAGGCTTAACCTCGGAACTGCTTTCAAAACTCACAAACTAGAGAGTGATAGAGGATGGCGGAATTCCTAGTGTAGAGGTGAAATTCTTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGCCATCTGGGTCACATCTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCGTGTAGTCCGGCTGACTGACTCGAGAGTTATCTCGTA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG5=BFGGGGGGFGFGGGGGFGFGGFFGGGGGGFGFFFFFFFDEFFFFFFFFBAFFFF:B:><>>4<?
@M01498:88:000000000-AA3KR:1:1101:18953:1728 1:N:0:2
TACGAAGGGGGCTAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGCGTAGGCGGATGACACAGTCAGATGTGAAATTCCTGGGCTTAACCTGGGGGCTGCATTTGATACGTGTTGTCTAGAGTGAGGAAGAGGGTTGTGGAATTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCAACCTGGTCCTTGACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCTGGTAGTCCGGCTGACTGACTCGAGAGTTATCTCGTA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGFG:FFGGGGGGGGGGFGGCGGGGGGGGGGGGGGGGGGFGFGGGGFDGGG,>BBGFGGGGGGGGGGGFGFGGGG>EGGGGGGGGGGEGGFDGFFGDGGGGGGGDFGGGGGGGG9CFGGFGGGGGDGECEGEGGGCEGGDE58CEFGGGEGGGGFCCFGGGGGG=EFFGGGGGE5DFDGEGGG37@GGGGGGG?DFFFGFGGFGFFFFFFFFFGEF>90:FFBF))2:0((..:)-4<:?(
@M01498:88:000000000-AA3KR:1:1101:15641:1976 1:N:0:2
TACGAAGGGGGCTAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGCGTAGGCGGATTGGTTAGTTAGGTGTGAAATTCCTGGGCTTAACCTGGGGACTGCACTTAATACAGCCAGTCTAGAGTGGGATAGAGGGTTGTGGAATTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCAACCTGGATCTTGACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCTGGTAGTCCGGCTGACTGACTCGAGAGTTATCTCGTA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGFGGGGGFGECFEFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCFGGGGGGGGGGGGGGCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGFFGFGGGGGGGGCGGFFFGGFCCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGC@FCGGGGCCEFC6CGGGGGFGGGDEGGF77CGGGGFGFGGGGGGG?

Nicholas_Bokulich · September 27, 2018, 4:15pm

The deblurred sequences and OTU table should be importable as FeatureData[Sequence] and FeatureTable[Frequency] artifacts. I suspect the importing issues you had are because you attempted to import as a different sequence type.

q2-deblur will produce FeatureData[Sequence] and FeatureTable[Frequency] artifacts as output. You can merge these with the EMP sequences/tables using merge-seqs and merge, respectively.

That's a good sign — no news is good news while dada2 is running!

okay we can rule out blaming trimmomatic!

Everything looks okay to me — the only abnormality I notice is that the NCBI headers do not look like normal Illumina headers, but I doubt that would impact deblur.

Another possibility (however outrageous): are you sure you have 16S data? EMP also sequenced 18S.

Nicholas_Bokulich · September 27, 2018, 5:19pm

@Luke_Thompson do you have any insight into what is happening here? @Byron_C_Crump cannot process EMP sequences with deblur. Thanks!

Byron_C_Crump · September 27, 2018, 6:16pm

I blasted several of the EMP sequences and they are bacteria. There are no redundant samples represented in the dataset I'm analyzing, so I think they are all 16S and not 18S. I can look deeper, but I did not see 18S sequences for samples from my study on NCBI.

I looked at the data on EMP FTP, but I don't see where to download the deblurred data for my study. On QIITA, the processed data is cut at 100 or 150bp and not inbetween. Most of the sequences are >120 but <150. So I'm hoping we can figure out how to run deblur on these sequences.

Luke_Thompson · September 27, 2018, 9:27pm

@Byron_C_Crump, I'm sorry to hear you're having problems with the EMP data.

If you do want to try using the already Deblur'd EMP data using the 100bp trim length, running Deblur on your data with the same 100bp trim length, and then merging them, here's what you can do:

Download the 100bp Deblur table (emp_deblur_100bp.release1.biom) from the FTP site or the Zenodo archive.
Filter the biom table for just the sample names you want, or just the metadata category in the EMP mapping file corresponding to the samples you want (eg, study_id or empo_3).
Run Deblur on your samples using a trim length of 100bp.
Merge the two tables.

You will lose some information by trimming the reads to 100bp instead of 120bp, but for many purposes this does not make a difference. Also, if many of the EMP studies you are interested in were sequenced with a read length of 100bp, you will lose those studies entirely. Note that there is no variation per study in sequence length coming off the sequencer. Quality trimming will decrease the length of some sequences, but we have found that most sequences in a study will not be trimmed in this process. The median read length after trimming for each study is indicated in the column "Read length (bp)" in Supplementary Table 1 from the 2017 Nature paper.

Now, I know you would prefer to have longer reads, i.e., 120bp, and this isn't working for you. Here are a few thoughts on this:

I did not personally run Deblur on the published EMP dataset. This was done by @pitaman using an older (stand-alone, non-QIIME 2) version of Deblur. I haven't had the need to run the QIIME2 version of Deblur on any EMP data, so I cannot verify that it works with the data on EBI (which should be the same as the data on NCBI).

If the EMP studies you are interested are in Qiita---which they should be, and the study_id in the mapping file and in Supplementary Table 1 will correspond to the Qiita study ID---then you can download the fastq data from Qiita. It's possible this version of the data is different from what's at NCBI. You might even be able to run Deblur directly on Qiita using a 120bp trim length, but I'm not sure about this. You should however be able to download the demux version of the data and run Deblur on that.

I hope this is helpful.

Byron_C_Crump · September 28, 2018, 8:49pm

Hi all,

I downloaded the EMP dataset from the QIITA site, but I found that that wont deblur either.

I'd like help figuring this out so I can use longer reads. This may be a lot to ask, but would anyone mind downloading the dataset and trying it on your computers? The file can be downloaded by clicking on the "(demultiplexed)" node on the 16S tree on this website:
https://qiita.ucsd.edu/study/description/1883#

For now, I'll try Luke's approach, but I cannot find a tool in QIIME2 that I can use to filter the biom table by study. Do I have to install QIIME1 for this?

Byron

Luke_Thompson · September 28, 2018, 9:27pm

Hi Byron,

I believe the QIIME 2 command you're looking for is qiime feature-table filter-samples.

Luke

Byron_C_Crump · September 29, 2018, 8:21pm

Thank Luke,

But it turns out that I cannot deblur my own sequences either (the "cgrb" sequences I referred to above). So I remain stuck and unable to analyze my data with qiime2 deblur. Dada2 ran on my sequences (but not on the EMP sequences) so I think my sequences are OK and maybe there is something wrong with deblur.

The error I get is:
“Plugin error from deblur:
No sequences passed the filter. It is possible the trim_length (%d) may exceed the longest sequence, that all of the sequences are artifacts like PhiX or adapter, or that the positive reference used is not representative of the data being denoised.”

The only one of these possibilities that seems possible is that the positive reference is incorrect. I just used the default reference (whatever came with qiime2). Was that a mistake? Do I need to figure out how to use a non-default reference?

Byron

Nicholas_Bokulich · October 1, 2018, 5:54pm

@Byron_C_Crump,
Sorry this is still giving you trouble!

Your data are 16S, so the default positive reference should work.

No, but you can experiment with this using the denoise-other method, which allows you to input your own reference sequences.

It looks like the sequences from that study are ~20 GB... I am happy to give this a test for you, but perhaps on a small test subset. For both the EMP sequences and the cgrb sequences, could you please:

export your demultiplexed sequences
use head -n 40 path-to-file.fastq > seqs.fastq to select the first 10 sequences
Inspect those sequences to make sure they look okay (if not, select 10 sequences from elsewhere in the files)
Import to QIIME 2
Confirm that you get the same deblur error on those files
Send along

Looks like QIITA can also trim at 125 nt, so I have started a 125 nt trimming/deblur workflow on the EMP dataset you linked to (it is possible for anyone with an account to submit workflows by clicking on a node in the "processing network" and clicking on the "process" button).

You can trim the cgrb data to 125 nt, deblur locally, and then merge your data?

Byron_C_Crump · October 2, 2018, 5:14pm

Thanks for explaining about the processing capability in qiita! I see that you started the process and that the status is waiting. I will prepare to merge the results with my own dataset.

I will also prepare a small test dataset for you to process like you described.

Byron_C_Crump · October 9, 2018, 12:05am

Hi Nicholas,
I exported the demultiplexed EMP sequences, but am not sure what to do next. There are 2,454 .fastq.gz files. Do you want me to unzip one of them and select the first 10 sequences from it, import those 10 sequences into a new artifact and run deblur?
Byron

Nicholas_Bokulich · October 9, 2018, 5:43pm

Yes, that would be the best way to subset if you already have demultiplexed sequences.

Good luck!

Byron_C_Crump · October 11, 2018, 2:00pm

Here is the .qza of 10 sequences from the EMP dataset. dada2 and deblur failed to run on this small file. I got the same errors as I do when I run the large file. Thanks for the help!

single-end-demux_qiime2_forum.qza (5.6 KB)

I ran dada2 with this command:
qiime dada2 denoise-single --i-demultiplexed-seqs single-end-demux_qiime2_forum.qza --p-trim-left 0 --p-trunc-len 120 --p-n-threads 0 --p-chimera-method consensus --o-representative-sequences rep-seqs-dada2_qiime2_forum.qza --o-table table-dada2_qiime2_forum.qza --o-denoising-stats stats-dada2_qiime2_forum.qza

And got this error:
Plugin error from dada2:
No features remain after denoising. Try adjusting your truncation and trim parameter settings.
Debug info has been saved to /var/folders/14/2xdw_prn4_z9ds_35n20v8_80000gn/T/qiime2-q2cli-err-37q6eloj.log

I ran deblur with this command:
qiime2_forum byroncrump$ qiime deblur denoise-16S --i-demultiplexed-seqs single-end-demux_qiime2_forum.qza --p-trim-length 120 --p-jobs-to-start 20 --p-sample-stats --verbose --o-representative-sequences deblur-rep-seqs_qiime2_forum.qza --o-table deblur-table_qiime2_forum.qza --o-stats deblur-stats_qiime2_forum.qza

And I got this error:
Plugin error from deblur:
No sequences passed the filter. It is possible the trim_length (%d) may exceed the longest sequence, that all of the sequences are artifacts like PhiX or adapter, or that the positive reference used is not representative of the data being denoised.
See above for debug info.