q2-sidle reconstruct-counts pluging error

elsamdea · June 29, 2021, 4:17pm

Hi everyone!!

I am playing with the great q2-sidle to try to analyze five amplicons of five hipervariable regions of the 16S rRNA bacterial gen. However, when I run qiime sidle reconstruct-counts command, I get the next error message:

 qiime sidle reconstruct-counts --p-region AmpliconR1_GG --i-kmer-map /home/elsa/Documentos/Doctorado/QIIME2/SMURF-CCR/green_database/db_ampliconR1/ggenes-13-8-db-Amplicon-R1-100nt-map.qza --i-regional-alignment R1_CCR16_6samples-align-map.qza --i-regional-table /home/elsa/Documentos/Doctorado/QIIME2/SMURF-CCR/v2smurf_11062021-271140879/FASTQ_Generation_2021-06-14_14_52_34Z-427957897/CCR16_6TIPOS/table-R1-TRIM-CA-CCR16-6samples-100nt.qza --p-region AmpliconR2_GG --i-kmer-map /home/elsa/Documentos/Doctorado/QIIME2/SMURF-CCR/green_database/db_ampliconR2/ggenes-13-8-db-Amplicon-R2-100nt-map.qza --i-regional-alignment R2_CCR16_6samples-align-map.qza --i-regional-table /home/elsa/Documentos/Doctorado/QIIME2/SMURF-CCR/v2smurf_11062021-271140879/FASTQ_Generation_2021-06-14_14_52_34Z-427957897/CCR16_6TIPOS/table-R2-TRIM-CA-CCR16-6samples-100nt.qza --p-region AmpliconR4_GG --i-kmer-map /home/elsa/Documentos/Doctorado/QIIME2/SMURF-CCR/green_database/db_ampliconR4/ggenes-13-8-99-db-Amplicon-R4-100nt-map.qza --i-regional-alignment R4_CCR16_6samples-align-map.qza --i-regional-table /home/elsa/Documentos/Doctorado/QIIME2/SMURF-CCR/v2smurf_11062021-271140879/FASTQ_Generation_2021-06-14_14_52_34Z-427957897/CCR16_6TIPOS/table-R4-TRIM-CA-CCR16-6samples-100nt.qza --p-region AmpliconR5_GG --i-kmer-map /home/elsa/Documentos/Doctorado/QIIME2/SMURF-CCR/green_database/db_ampliconR5/ggenes-13-8-99-db-Amplicon-R5-100nt-map.qza --i-regional-alignment R5_CCR16_6samples-align-map.qza --i-regional-table /home/elsa/Documentos/Doctorado/QIIME2/SMURF-CCR/v2smurf_11062021-271140879/FASTQ_Generation_2021-06-14_14_52_34Z-427957897/CCR16_6TIPOS/table-R5-TRIM-CA-CCR16-6samples-100nt.qza --p-n-workers 2 --o-reconstructed-table reconstruction-6samples-noR3/CCR16_6samples-table.qza --o-reconstruction-summary reconstruction-6samples-noR3/CCR16_6samples-summary.qza --o-reconstruction-map reconstruction-6samples-noR3/CCR16_6samples-map.qza --verbose
Traceback (most recent call last):
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2cli/commands.py", line 329, in __call__
    results = action(**arguments)
  File "<decorator-gen-219>", line 2, in reconstruct_counts
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/qiime2/sdk/action.py", line 244, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2_sidle/_reconstruct.py", line 101, in reconstruct_counts
    aligned_kmers = _get_unique_kmers(align_map['kmer'])
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2_sidle/_reconstruct.py", line 525, in _get_unique_kmers
    kmers = np.hstack([[a.split("@")[0] for a in kmer.split('|')]
  File "<__array_function__ internals>", line 5, in hstack
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/numpy/core/shape_base.py", line 346, in hstack
    return _nx.concatenate(arrs, 1)
  File "<__array_function__ internals>", line 5, in concatenate
ValueError: need at least one array to concatenate

Plugin error from sidle:

  need at least one array to concatenate

See above for debug info.

As you can see, I have indicated that I want to study five regions and I only include four in the command. The reason why I did not add the R3 region it is because I can't denoise it after trimming with dada2 as I did with the others.

I thought that maybe it could be an out of memory problem, but I reanalyzed the data in other server and still got the same error:

    qiime dada2 denoise-paired --i-demultiplexed-seqs R3-trimmed-CA-CCR16-6samples.qza --p-trim-left-f 4 --p-trim-left-r 2 --p-trunc-len-f 131 --p-trunc-len-r 136 --o-table table-R3-trimmed-CA-CCR16-6samples.qza --o-representative-sequences rep-seqs-R3-trimmed-CA-CCR16-6samples.qza --o-denoising-stats stats-R3-trimmed-CA-CCR16-6samples.qza
Plugin error from dada2:

  An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

Debug info has been saved to /tmp/qiime2-q2cli-err-a8ocf3ae.log
(qiime2-2021.4) elsa@compostela:~/MICROBIOTA/v2smurf_11062021-271140879/FASTQ_Generation_2021-06-14_14_52_34Z-427957897/CCR16_6TIPOS$ cat /tmp/qiime2-q2cli-err-a8ocf3ae.log
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /tmp/tmp6pzax_r6/forward /tmp/tmp6pzax_r6/reverse /tmp/tmp6pzax_r6/output.tsv.biom /tmp/tmp6pzax_r6/track.tsv /tmp/tmp6pzax_r6/filt_f /tmp/tmp6pzax_r6/filt_r 131 136 4 2 2.0 2.0 2 12 independent consensus 1.0 1 1000000

R version 4.0.3 (2020-10-10) 
Loading required package: Rcpp
DADA2: 1.18.0 / Rcpp: 1.0.6 / RcppParallel: 5.1.2 
1) Filtering The filter removed all reads: /tmp/tmp6pzax_r6/filt_f/CCR-16-1-FFPE-Gmet_S56_L001_R1_001.fastq.gz and /tmp/tmp6pzax_r6/filt_r/CCR-16-1-FFPE-Gmet_S56_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmp6pzax_r6/filt_f/CCR-16-1-FFPE-GsinMet_S57_L001_R1_001.fastq.gz and /tmp/tmp6pzax_r6/filt_r/CCR-16-1-FFPE-GsinMet_S57_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmp6pzax_r6/filt_f/CCR-16-1-FFPE-TumGrasa_S55_L001_R1_001.fastq.gz and /tmp/tmp6pzax_r6/filt_r/CCR-16-1-FFPE-TumGrasa_S55_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmp6pzax_r6/filt_f/CCR-16-1-FFPE-Tumor-A_S53_L001_R1_001.fastq.gz and /tmp/tmp6pzax_r6/filt_r/CCR-16-1-FFPE-Tumor-A_S53_L001_R2_001.fastq.gz not written.
The filter removed all reads: /tmp/tmp6pzax_r6/filt_f/CCR-16-1-FFPE-Tumor-B_S54_L001_R1_001.fastq.gz and /tmp/tmp6pzax_r6/filt_r/CCR-16-1-FFPE-Tumor-B_S54_L001_R2_001.fastq.gz not written.
Some input samples had no reads pass the filter.
xx.xxx
2) Learning Error Rates
127 total bases in 1 reads from 1 samples will be used for learning the error rates.
134 total bases in 1 reads from 1 samples will be used for learning the error rates.
3) Denoise samples .
.
4) Remove chimeras (method = consensus)
Error in isBimeraDenovoTable(unqs[[i]], ..., verbose = verbose) : 
  Input must be a valid sequence table.
Calls: removeBimeraDenovo -> isBimeraDenovoTable
Ejecución interrumpida
Traceback (most recent call last):
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 266, in denoise_paired
    run_commands([cmd])
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 36, in run_commands
    subprocess.run(cmd, check=True)
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['run_dada_paired.R', '/tmp/tmp6pzax_r6/forward', '/tmp/tmp6pzax_r6/reverse', '/tmp/tmp6pzax_r6/output.tsv.biom', '/tmp/tmp6pzax_r6/track.tsv', '/tmp/tmp6pzax_r6/filt_f', '/tmp/tmp6pzax_r6/filt_r', '131', '136', '4', '2', '2.0', '2.0', '2', '12', 'independent', 'consensus', '1.0', '1', '1000000']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2cli/commands.py", line 329, in __call__
    results = action(**arguments)
  File "<decorator-gen-616>", line 2, in denoise_paired
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/qiime2/sdk/action.py", line 244, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/home/elsa/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 279, in denoise_paired
    raise Exception("An error was encountered while running DADA2"
Exception: An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

So, I think that, perphas, these two errors are interrelated.

Thank you for your help!

Elsa

jwdebelius · June 29, 2021, 8:04pm

Hi @elsamdea,

The error is telling you that there's a problem with the aligned sequences and the mapping.

could you please try running qiime metadata tabulate passing in all your alignment artifacts and sharing the results?

Best,
Justine

elsamdea · June 29, 2021, 9:46pm

Hi @jwdebelius,

You are right, there's a problem with the aligned sequences. When I run qiime metadata tabulate in the five files, I get:

It is the same message in the five, so I only share it once.

May I try to rerun the qiime sidle align-regional-kmers command again with each file?

Oh, and here it is the visualization of the map file for this region:

Thank you!

Elsa

jwdebelius · June 29, 2021, 10:14pm

Thanks @elsamdea,

Could you attach your table summary (qiime feature-table summarize) as the artifact, so I can peak at the provenance? What are your alignment parameters?

Can you show me your commands for building the database?

Best,
Justine

elsamdea · June 30, 2021, 8:35am

Sure @jwdebelius!

Here it is the table summary of the first region (I can upload the others too, if you want):

(upload://iV4Dx2hETav68NukU1cgslToxQK.png)

The commands were the following:

Download the database greengenes:

wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz

File Descompresion
tar xzvf gg_13_8_otus.tar.gz
Import to x

TAXONOMY
qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat --input-path taxonomy/99_otu_taxonomy.txt --output-path green-genes-99-otu-taxonomy.qzaImported taxonomy/99_otu_taxonomy.txt as HeaderlessTSVTaxonomyFormat to green-genes-99-otu-taxonomy.qza

REP-SEQS
qiime tools import --type 'FeatureData[Sequence]' --input-path rep_set/99_otus.fasta --output-path not-aligned-rep-seq-green-genes-99-otu.qza
Visualization

qiime feature-table tabulate-seqs --i-data not-aligned-rep-seq-green-genes-99-otu.qza --o-visualization not-aligned-rep-seq-green-genes-99-otu.qzv

Filtering the database: degenerated sequences and taxas exclusion:

qiime sidle filter-degenerate-sequences --i-sequences not-aligned-rep-seq-green-genes-99-otu.qza --p-max-degen 3 --o-filtered-sequences sidle-rep-seq-notaligned-GGenes-99-otu.qza

The original SMURF tool suggested the value of 3 in the parameter --p-max-degen (the maximum number of degenerate sequences for a sequence to be retained).

Taxas exclusion:

qiime taxa filter-seqs --i-sequences sidle-rep-seq-notaligned-GGenes-99-otu.qza --i-taxonomy taxonomy-green-genes-99-otu.qza --p-include __p --p-exclude mitochondria,chloroplast,metazoa --p-mode contains --o-filtered-sequences ggenes-13-8-db-99-otu-filtered-by-phylum-tax-exclude-mitochondria-chloroplast-metazoa.qza

Regional Database Preparation

6.1. Extract reads

`qiime feature-classifier extract-reads --i-sequences /home/elsa/Documentos/Doctorado/QIIME2/SMURF-CCR/v2smurf_11062021-271140879/FASTQ_Generation_2021-06-14_14_52_34Z-427957897/CCR16_6TIPOS/ggenes-13-8-db-99-otu-filtered-by-phylum-tax-exclude-mitochondria-chloroplast-metazoa.qza 
--p-f-primer TGGCGAACGGGTGAGTAA --p-r-primer CCGTGTCTCAGTCCCARTG --o-reads ggenes-13-8-99-filt-F1-R1-set1-primers.qza

6.2. Prepare extracted region

qiime sidle prepare-extracted-region --i-sequences silva-138-99-filt-F5-R5-set5-primers.qza --p-region "AmpliconR1_GG" --p-fwd-primer TGGCGAACGGGTGAGTAA --p-rev-primer CCGTGTCTCAGTCCCARTG --p-trim-length 100 --o-collapsed-kmers ggenes-13-8-db-Amplicon-R1-100nt-kmers.qza --o-kmer-map ggenes-13-8-db-Amplicon-R1-100nt-map.qza

Thanks!

Elsa

jwdebelius · June 30, 2021, 3:32pm

Hi @elsamdea,

I need the artifact file (the qvz), not just screenshoots to be able to trouble shoot. However, it looks like you're having a problem with denoising. The issue isn't so much Sidle as that few sequences survived denosing (you've got 176 on average which is very low) and only 14 features. So, you need to focus on working through denoising before you can run Sidle.

Best,
Justine

elsamdea · July 1, 2021, 12:29pm

Hi @jwdebelius,

Sorry, I missunderstood you.

But yes, I have some troubles with the denoising. I have tried with dada2 and deblur, but I still keep few sequences and low number of features. I have tried with only forward reads, but the problem still persists. Maybe somebody could say other suggestion?

Again, thank you so much for your help! When I fix the denoising problem, I will continue with Sidle.

Regards,

Elsa

elsamdea · July 16, 2021, 8:40am

Hi @jwdebelius,

First of all, thank you for your patience.

I followed your adviced and worked on the denoising steps. Right now, I have good results (I think) after the denoising and dada2-trim-posthoc steps. (I found out the reason after my sequence lost: I didn't realize the --p-trunc-len-f/r "value" parameter trimmed the sequences which sequence length were inferior to the "value", I though it trimmed after this position of the sequence length). Also, I found out that the --p-indels parameter in cutadapt wasn't working good with my sequences.

Well, when I got these results, I though I could go to the next level: the Sequence Reconstruction Phase. However, this is resisting me.

So, I still have troubles with the alignment steps. In the moment I try to visualize the results with the command qiime metadata tabulate, I only obtain the error message: :"Metadata must contain at least one ID".

Maybe it is a problem related to the database step? I checked the kmer and map files and they look normal. Also, I re-did this initial phase of database preparation with two different databases: silva-138 y greengenes-13-8 and neither worked. I don't know where I could be failed.

Any help will be great!!

Best,

Elsa

jwdebelius · July 16, 2021, 4:46pm

Hi @elsamdea,

Thanks for being patient with me as we work through this!

I'm glad that you figured out your demultiplexing and the environment looks good. There is definitely a problem with alignment, and I don't think this time it's due to low sequence counts. (We're making progress !)

There are human or mammalian associated samples, right and 16S sequencing? I'd expect the database to cover them, so I don't think its a database matching isssue. Could you share an alignment file, even if it fails the metadata summary? You can dm it to me if you don't want to post it here.

Best,
Justine

jwdebelius · August 13, 2021, 6:44pm

Thanks for sending me the files @elsamdea!

Looking at the Dada2 parameters, it looks like you're truncating your data at 22 forward and 22 reverse in dada2 (trim-left-f and trim-left-r arae both 22). Im not sure if you maybe passed rep seqs after denoising (the one you used has UUID 4c5c1321-fe62-4bd4-a625-ac8dd9230525; you can double check other UUIDs with qiime2 view.

But, I think it should throw an error if the sequence lengths don't match.

Best,
Justine

elsamdea · August 17, 2021, 8:34am

Hi @jwdebelius!

Thank you so much!

I am not sure if I answered your question before: Yes, I am working with human samples that had been sequencing using Illumina technology.

So, it is about dada2 instead. I am truncating at 22 into the 5' end, because in this range the quality is still low. Am I wrong about this aproximation?

After denoising I used the "trim-dada2-posthoc" to be secure that all the sequences had the same length. I am not sure if that is what you are asking about.

I have compare some sequences of the region 1 from the kmers (database) and rep-seqs (dada2 --> my samples) and there is not match between them. But I do not understand why the alignment command did not throw a bug if there were not alignments

Any suggestion about how should I proceed?

Thank you again for your time and help!!

Regards,

Elsa

jwdebelius · August 18, 2021, 3:07pm

Hi @elsamdea,

I think maybe you should relax your denosing parameters. You might even decide only to use the forward reads; that can be helpful if you're having drops in quality. I think navigating your denoising and retaining a read with at least 75-100nt for each region where at least 50% of your sequences make it through Dada2 would be a good place to start.

I'm not sure I can get the alignment to throw an error without doubling run time, which I'm trying to avoid. I can maybe get reconstruction to error sooner, would that be more helpful?

Best,
Justine

elsamdea · August 19, 2021, 11:41am

Hi @jwdebelius,

I have relaxed the dada2 parameters of trimming (setting --p-trim-left -f/-r and --p-trunc-len-f/-r to 0) and it works! After that the alignment step run!! (I made a comparation between the kmer db file and the rep-seqs dada2 file and find than the seqs did not start on the same position, so that was the error I was not awared of)...
However, I think I still need to adjust dada2... I am not feel confident with this step.

Yes, I think that is a good idea. Maybe an error message as "alignment file hasn't show any matches" or something similar?

Also, I was wondering about this: q2-sidle running out of memory at reconstruct-counts.

In this moment I am executing the reconstruction step (finally!! ) and (even though I set the --p-cores parameter to 8) it only uses one core. Maybe there is an estimation of the running time? It has been running for at least 8 hours instead.

Again, thank you so much for kindness and for responding so quickly!!

Best,

Elsa

elsamdea · August 31, 2021, 8:33am

Hi everyone!

I only want to highlight one thing about the parameter qiime sidle reconstruct-counts. It is neccesary to run the command in a highly computational environment if you are working with a huge dataset.

With "higly computational" I mean more than 64 GB of Ram memory.

Regards!

Elsa

jwdebelius · September 1, 2021, 8:00am

Hi @elsamdea,

The updated version (which I'm hoping will be released with the qiime2-2021.9(?) version should have solved this.

github.com/jwdebelius/q2-sidle

Memory refactor

jwdebelius:main ← jwdebelius:memory-refactor

opened 06:25PM - 11 Aug 21 UTC

jwdebelius

+1913 -1635

This addresses #25, where the code was failing at scale through a major refactor…. This PR: 1. Breaks the database reconstruction from the table reconstruction 2. Sets up the database reconstruction primarily through a dask distributed pipeline. With the new pipeline, I was able to run the Silva 138 database on a scale data example in ~15 minute on four core of my macbook. 3. Adds a pipeline to wrap reconstruction (database, table, and taxonomy) to make the processing easier @vitorheidrich - this is your issue, have you seen any more performance problems that need to be addressed? @mikerobeson, @boulund or @luhugerth - comments/review would be really nice Yes, it's a monster PR

I got Silva 138 with 6 regions to run on my 16GB laptop in like 30 minutes while doing other things. It's still in development, so not quite ready for relaasae, but hopefully that's an improvement for the memory issues.

Best,
Justine

elsamdea · September 1, 2021, 11:44am

Hi @jwdebelius!

That will be great!! I can't wait for it .

Best,

Elsa

system · October 2, 2021, 6:01pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.