Import --i-reference-taxonomy taxonomy.tsv to .qza

@Nicholas_Bokulich may contradict what I’m about to say, but from our preliminary results it would appear that classify-consensus-blast and classify-consensus-vsearch have very similar performance in terms of precision and accuracy, at least for 16S. For ITS data, classify-consensus-blast seems to perform better in simulated data sets.

Again from our preliminary results classify-consensus-blast might run up to 20 times faster for the 97% outs (but classify-consensus-vsearch may not be unacceptably slow - it still shouldn’t take more than around 10 minutes, depending on how many sequences you end up classifying).

Finally, for those classifiers, you should tweak the parameters to be 1 for maxaccepts, 0.99 for min-consensus, and 0.97 for perc-identity. However, you may wish to vary those recommendations depending on the purpose of your analysis.

Also, it’s worth mentioning using V6 doesn’t stop you from using fit-classifier-naive-bayes followed by classify-sklearn. The feature-classifier tutorial shows how train the feature classifier for alternative primer pairs. Alternatively, you can train the feature classifier on the entire 16S sequence. Our preliminary tests show that using the full sequence doesn’t affect the results much for the naive Bayes classifier.

1 Like

Thank you @jairideout @BenKaehler,

I did train the classifier (97%OTU) for V6. It works well.

My results show that there is a much higher portion of un-named sequences compared to Qiime 1 analysis , “Bacteria;__”, (~20% of total).

Between Qiime 1 and 2, the abundance pattern of other phyla across samples looks similar, only with smaller numbers % because of the “Bacteria;__” in Qiime 2.

Any thoughts on why the “Bacteria;__” is so high?

Thanks @Bing_Guo, that’s not something that we’ve observed before.

Do you mean that the classification was

k__Bacteria; p__; c__; o__; f__; g__; s__

or

k__Bacteria?

If it is the latter then perhaps it may be related to how the confidence parameter is used differently between the RDP classifier and the QIIME 2 classifier.

If you’re willing to share your data I would be happy to have a look at it.

1 Like

Hi @BenKaehler , here is the summary of my analysis.

I used only forward sequences from illumina results. The steps and summary are in slide1. I used trained classifier 97%OTU for qiime 2.

The total counts after filtering are quite different between Qiime 1 and 2, but the total observation numbers are close, and the taxon abundance are similar expcept for the "Bacteria;" at phylum, "Bacteria;;_" at class level.

2 Likes

Thanks @Bing_Guo, my specialty is the q2-feature-classifier step, but it appears that you’re comparing the output of the two different pipelines, not just that stage. I think one of the QIIME gurus would probably have more valid input at this point. @jairideout or @gregcaporaso?

Hi @Bing_Guo, I’m not sure why the Bacteria; __ group is so much higher abundance in QIIME 2 than QIIME 1. Are you comparing the same type of classification algorithm between the two pipelines? If you’re using RDP Classifier in QIIME 1, you should compare that against the Naive Bayes classifier in QIIME 2; if you’re using the uclust classifier in QIIME 1, you should compare that against the vsearch classifier in QIIME 2.

1 Like

Hi @gregcaporaso,

I used uclust in Qiime1, and naive bayes in Qiime 2. That seems a mixed comparison. I will try vsearch for Qiime2 and post the results later.

But my other concern is that the reads per sample in Qiime 1 is way higher than that sequence per sample Qiime 2. Is it only my samples or generally like that?

Thank you.

Hello @gregcaporaso, you are absolutely right…I tried vsearch, it’s the same, although the “depth” is very different.

I’m wondering if the difference in the frequency it’s because that I used “deblur” instead of dada2? Deblur is much faster than DADA2 since I don’t have access to a super computer, tired amazon cloud for a small test, not finishing 1 sample after a few hours. I have no idea how long it takes and how much it costs for a project approximately 10 Million reads total, I’m worried if it comes up to hundreds of $.

Thanks

@Bing_Guo,
I primarily use DADA2 for my own analyses, and haven’t seen big differences in the number of sequences retained between QIIME 1 and QIIME 2 in those cases (for example, when running the same dataset with the two different pipelines). I would be interested to hear if the number of sequences was much different with DADA2.

Are you using DADA2 in multi-threaded mode (i.e., passing --p-n-threads with either 0, to use all available cores, or some value larger than 1 to specify the number of cores to use)? That should help speed up the run time if you can run it on a machine with multiple cores.

@benjjneb, do you have any thoughts on how long DADA2 should take to run on this dataset?

1 Like

@Bing_Guo Did you try DADA2 on your dataset and find that it took a long time? If so we would like to know, as a dataset of that size (~10M reads spread across dozens(?) of samples) shouldn’t take that long to process, even on a laptop.

It’s hard to give accurate running time estimates from just a number of reads, as running time more closely tracks the number of unique sequences in your data (which in turn depends on how diverse the underlying community is, how high your chimera rate was, how high your error rate was…) but if you turn on mulithreading, I would expect that sort of dataset to complete within a day on a basic Amazon instance.

Again, let us know if that’s not the case!

Hi @benjjneb,

I did try DADA2 with one sample (84,000 reads in either forward or reverse), after 5 hours, not finished. So I ctrl+c the job. I’m using a MacBook Air, which is not a powerful computation laptop.

Also, our lab is planning to invest on a new desktop for the qiime2 computation. What could be your suggestions on the hardware, RAM, storage, etc?

Thanks a lot.

Did you use DADA2 in multi-threaded mode by passing --p-n-threads? If so, how many threads did you specify?

It's possible your job either ran out of memory, or too many threads were specified with --p-n-threads. If you end up rerunning DADA2 on your MacBook Air or on the Amazon cloud, I recommend monitoring both memory and CPU usage while the job is running to see if the system is getting overloaded.

You could also try running DADA2 on a very small subset of your sequence data, just to confirm that it can complete successfully on your computer.

This question has come up a couple of times before, see these forum topics:

1 Like

That is highly unexpected, such a small dataset should complete in seconds or minutes at most...

Could you share the data from a sample that is taking so long so I can confirm (or not) that behavior? And which version of Q2 do you have? (also, do the Q2 masters have any suggestions on diagnosing comp performance issues?)

Edit: @jairideout had a couple good suggestions as well, especially confirming that dada2 can complete on a trivial dataset to rule out an installation issue.

Hi @jairideout, @benjjneb, @gregcaporaso,

I tried last night with multi-thread 0, which shows 4 in the system activity and 390%CPU (~98%*4).

After a while, I got an error message.

$ qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --o-table table --o-representative-sequences rep-seqs --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 300 --p-trunc-len-r 300 --p-n-threads 0

Plugin error from dada2:

Command '['run_dada_paired.R', '/var/folders/5y/k_d_lfy57pxbg3l7j_3r9y
sh0000gn/T/tmp6rkj5go1/forward', '/var/folders/5y/k_d_lfy57pxbg3l7j_3r
9ysh0000gn/T/tmp6rkj5go1/reverse', '/var/folders/5y/k_d_lfy57pxbg3l7j_
3r9ysh0000gn/T/tmp6rkj5go1/output.tsv.biom',
'/var/folders/5y/k_d_lfy57pxbg3l7j_3r9ysh0000gn/T/tmp6rkj5go1/filt_f',
'/var/folders/5y/k_d_lfy57pxbg3l7j_3r9ysh0000gn/T/tmp6rkj5go1/filt_r',
'300', '300', '0', '0', '2.0', '2', 'consensus', '1.0', '0',
'1000000']' returned non-zero exit status 1

Debug info has been saved to /var/folders/5y/k_d_lfy57pxbg3l7j_3r9ysh0000gn/T/qiime2-q2cli-err-4vorneiu.log.

I installed Qiime 2-2017.6, I also have MacQiime 1 installed.

Here is the qza I used.
https://drive.google.com/open?id=0B51bEY9nCyLXbm5sMm9aVXRDRlU

Thank you, everyone.

Thanks for sharing your data @Bing_Guo! I’m running the command locally now to see if I can reproduce the error. I’ll follow up here when that completes.

If the debug logfile is still around, can you post it? The file is located at /var/folders/5y/k_d_lfy57pxbg3l7j_3r9ysh0000gn/T/qiime2-q2cli-err-4vorneiu.log on your filesystem.

I was able to reproduce the error you’re seeing. It doesn’t appear to be a memory issue – the process used ~1.5GB memory on my system pretty consistently with 4 threads. Here’s the full debug output I received:

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /tmp/tmpcldhjrvi/forward /tmp/tmpcldhjrvi/reverse /tmp/tmpcldhjrvi/output.tsv.biom /tmp/tmpcldhjrvi/filt_f /tmp/tmpcldhjrvi/filt_r 300 300 0 0 2.0 2 consensus 1.0 0 1000000

R version 3.3.1 (2016-06-21) 
Loading required package: Rcpp
Warning messages:
1: multiple methods tables found for ‘arbind’ 
2: multiple methods tables found for ‘acbind’ 
3: replacing previous import ‘IRanges::arbind’ by ‘SummarizedExperiment::arbind’ when loading ‘GenomicAlignments’ 
4: replacing previous import ‘IRanges::acbind’ by ‘SummarizedExperiment::acbind’ when loading ‘GenomicAlignments’ 
5: multiple methods tables found for ‘left’ 
6: multiple methods tables found for ‘right’ 
DADA2 R package version: 1.4.0 
1) Filtering .
2) Learning Error Rates
2a) Forward Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 72607 reads in 40326 unique sequences.
   selfConsist step 2 
   selfConsist step 3 
   selfConsist step 4 
   selfConsist step 5 
   selfConsist step 6 
   selfConsist step 7 
   selfConsist step 8 
   selfConsist step 9 
   selfConsist step 10 

Warning message:
In dada(drpsF, err = NULL, selfConsist = TRUE, multithread = multithread) :
  Self-consistency loop terminated before convergence.
2b) Reverse Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 72607 reads in 40326 unique sequences.
   selfConsist step 2 
   selfConsist step 3 
   selfConsist step 4 
   selfConsist step 5 
   selfConsist step 6 
   selfConsist step 7 
   selfConsist step 8 
   selfConsist step 9 
   selfConsist step 10 

Warning message:
In dada(drpsR, err = NULL, selfConsist = TRUE, multithread = multithread) :
  Self-consistency loop terminated before convergence.

3) Denoise remaining samples 
4) Remove chimeras (method = consensus)
Error in isBimeraDenovoTable(unqs[[i]], ..., verbose = verbose) : 
  Input must be a valid sequence table.
Calls: removeBimeraDenovo -> isBimeraDenovoTable
In addition: Warning message:
In is.na(colnames(unqs[[i]])) :
  is.na() applied to non-(list or vector) of type 'NULL'
Execution halted
Traceback (most recent call last):
  File "/home/jairideout/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 179, in denoise_paired
    run_commands([cmd])
  File "/home/jairideout/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 35, in run_commands
    subprocess.run(cmd, check=True)
  File "/home/jairideout/miniconda3/envs/qiime2-2017.7/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['run_dada_paired.R', '/tmp/tmpcldhjrvi/forward', '/tmp/tmpcldhjrvi/reverse', '/tmp/tmpcldhjrvi/output.tsv.biom', '/tmp/tmpcldhjrvi/filt_f', '/tmp/tmpcldhjrvi/filt_r', '300', '300', '0', '0', '2.0', '2', 'consensus', '1.0', '0', '1000000']' returned non-zero exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jairideout/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/q2cli/commands.py", line 222, in __call__
    results = action(**arguments)
  File "<decorator-gen-138>", line 2, in denoise_paired
  File "/home/jairideout/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/qiime2/sdk/action.py", line 201, in callable_wrapper
    output_types, provenance)
  File "/home/jairideout/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/qiime2/sdk/action.py", line 334, in _callable_executor_
    output_views = callable(**view_args)
  File "/home/jairideout/miniconda3/envs/qiime2-2017.7/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 194, in denoise_paired
    " and stderr to learn more." % e.returncode)
Exception: An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

The key part of the debug log is:

4) Remove chimeras (method = consensus)
Error in isBimeraDenovoTable(unqs[[i]], ..., verbose = verbose) : 
  Input must be a valid sequence table.
Calls: removeBimeraDenovo -> isBimeraDenovoTable

This error message is being raised within the DADA2 R package. My guess is that all sequences are being filtered out at some previous step in the DADA2 processing pipeline, but after Googling around I haven’t found any info/questions regarding the error message.

@benjjneb do you have any ideas? The .qza and q2-dada2 command are posted above – it took several hours for the error to be raised when running locally with 4 threads, just a heads up.

There's a couple problems here. First, these forward and reverse sequence files can't actually be right -- they are identical to one another. This results in no sequences being merged, and the unintuitive error message which should be improved on the plugin side:

The second problem is that the primers are not removed -- the v6 primer with 3 ambiguous nucleotide (AAACTYAAAKGAATTGRCGG) is at the start of each read. Those ambiguous nucleotides appear as real variation to exact sequence variant methods like DADA2, and cause the long run times and over-inflated diversity at the end, as every biological variant gets split 8-ways.

So two things need to happen: (1) find the right reverse files or use the forward reads alone, and (2) remove the primers before processing or with the --p-trim-left [FWD_PRIMER_LENGTH] argument if using dada2.

3 Likes

Thanks a lot @benjjneb

You are right. I corrected the sequence files, and trimmed the primer sequences.

At the end, for this 1 sample, it took 1hour15min to run DADA2 denoise-paired.

Some more observations:

  • multi-thread did not accelerate in this case;
  • truncate the reverse sequences at 240 (from 300) resulted in higher total sequence number of good-quality sequences (in table.qza): 26k vs. 52k.
  • if different truncate lengths result in different results, where should we draw the line so it is good enough Q-score, but not compromising too much of the overlap length of forward/reverse?

Thanks again @jairideout, @benjjneb, @BenKaehler, @gregcaporaso.

Bing

1 Like

I'm not sure why that would be. Even on a laptop, there should be 2-4 threads available, and when I test nthreads=1,2,4 with this data on my own Mac laptop I get the expected near-linear speedup. Dunno.

The F/R reads must overlap enough to merge them (20 nts + biological_length_variation of overlap), but provided that requirement is met, it is better to trim off the very low quality tails that often happen in the reverse Illumina reads.

Exact sequence variant methods like DADA2 algorithm rely on repeated observations of identical sequences to call biological sequences. When sequences have a very low probability of being error-free over the whole read (as is the case when low-quality error-rich sequence tails are kept) the algorithm will fail to identify low-frequency variants, because there is only 1 or 0 error-free reads representing those variants. This is what caused you to identify more sequence variants when trimming the reverse read to 240 -- the algorithm was better able to detect the low-frequency variants from the reverse data.

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.