rescript dereplicate error

devonorourke · September 30, 2020, 7:55pm

Curious if anyone has a quick suggestion on how to troubleshoot the following error. I'm trying to run rescript dereplicate to cluster a set of sequences while retaining the best taxonomy strings:

qiime rescript dereplicate \
    --i-sequences repSeqs.qza \
    --i-taxa sklearn_tax.qza \
    --p-mode 'super' \
    --p-perc-identity 0.985 \
    --p-threads 4 \
    --p-derep-prefix \
    --output-dir NB_p985

The initial VSEARCH clustering appears to run fine, but then the program crashes once it launches into the RESCRIPt section (think?). I get the following error message which is specific enough to tell me that I'm giving two bits of information when I should only be giving one... but I don't know what those two bits are!

Plugin error from rescript:

  Wrong number of items passed 2, placement implies 1

Maybe @SoilRotifer has seen this kind of message before? The full error message is below.
Thanks very much for any advice you can offer!

(rescript_2020.6) dorourke@bio653:clustseqs$ qiime rescript dereplicate --i-sequences /home/dorourke/projects/coi_diet/paper3/qiime/select_libs/reads/NHsampleOnly_nobat
ASV_repSeqs.qza --i-taxa /home/dorourke/projects/coi_diet/paper3/qiime/select_libs/tax/nbayes/tmp.raw_bigDB_NBtax.qza --p-mode 'super' --p-perc-identity 0.985 --p-threa
ds 4 --p-rank-handles 'greengenes' --p-derep-prefix --output-dir naiveBayes_clust --verbose
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --derep_prefix /tmp/qiime2-archive-taibf4fc/772fc70b-e23c-40eb-9509-27cdcbcb614e/data/dna-sequences.fasta --output /tmp/tmpz39bwqsp --uc /tmp/tmp9jo0o8
pt --qmask none --xsize --threads 4

vsearch v2.7.0_linux_x86_64, 251.9GB RAM, 16 cores
https://github.com/torognes/vsearch

Reading file /tmp/qiime2-archive-taibf4fc/772fc70b-e23c-40eb-9509-27cdcbcb614e/data/dna-sequences.fasta 100%
1659488 nt in 9134 seqs, min 181, max 200, avg 182
Sorting by length 100%
Dereplicating 100%
Sorting 100%
9127 unique sequences, avg cluster 1.0, median 1, max 2
Writing output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --cluster_size /tmp/tmpz39bwqsp --id 0.985 --centroids /tmp/q2-DNAFASTAFormat-ovbrxohu --uc /tmp/tmp9jo0o8pt --qmask none --xsize --threads 4

vsearch v2.7.0_linux_x86_64, 251.9GB RAM, 16 cores
https://github.com/torognes/vsearch

Reading file /tmp/tmpz39bwqsp 100%
1658210 nt in 9127 seqs, min 181, max 200, avg 182
Sorting by abundance 100%
Counting k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 5483 Size min 1, max 65, avg 1.7
Singletons: 3874, 42.4% of seqs, 70.7% of clusters
Traceback (most recent call last):
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Taxon'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1069, in set
    loc = self.items.get_loc(item)
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Taxon'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/q2cli/commands.py", line 329, in __call__
    results = action(**arguments)
  File "<decorator-gen-151>", line 2, in dereplicate
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
    output_types, provenance)
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/rescript/dereplicate.py", line 54, in dereplicate
    taxa, sequences, clustered_seqs, uc, mode=mode)
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/rescript/dereplicate.py", line 116, in _dereplicate_taxa
    uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/frame.py", line 3487, in __setitem__
    self._set_item(key, value)
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/frame.py", line 3565, in _set_item
    NDFrame._set_item(self, key, value)
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/generic.py", line 3381, in _set_item
    self._data.set(key, value)
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1072, in set
    self.insert(len(self.items), item, value)
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1181, in insert
    block = make_block(values=value, ndim=self.ndim, placement=slice(loc, loc + 1))
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 3284, in make_block
    return klass(values, ndim=ndim, placement=placement)
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 2792, in __init__
    super().__init__(values, ndim=ndim, placement=placement)
  File "/home/dorourke/miniconda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 128, in __init__
    "{mgr}".format(val=len(self.values), mgr=len(self.mgr_locs))
ValueError: Wrong number of items passed 2, placement implies 1

SoilRotifer · September 30, 2020, 8:56pm

Hi @devonorourke, this might be the issue:

It appears that the label Taxon is can not be accessed.

Can you do the following sanity checks:

Can you check that the taxonomy column has a header labeled Taxon.
Does this occur when using a different setting for --p-mode, e.g. uniq, lca ?

-Thanks!

devonorourke · October 1, 2020, 1:38pm

Thanks @SoilRotifer,

I didn't end up testing your second sanity check (checking different --p-mode settings). However, in response to your first question about checking for the taxonomy column to have a Taxon header...

I exported the initial taxonomy.qza file. It had 3 column headers, one of which was Taxon. Here's a snippet of the .tsv file that was the result of the qiime tools export command:

Feature ID      Taxon   Confidence
0001497ccb3e484868f0c5d332e65cb3        Animalia;Arthropoda;Insecta;Lepidoptera;Geometridae;Plagodis;Plagodis serinaria 0.92337497915378
0002dd2a04ee401332d8d1e9ecb61057        Animalia;Arthropoda;Insecta;Diptera;Chironomidae;Cryptochironomus       0.9076573393973908
000ab0c0866f35dc9800d62b6b3351f1        Animalia;Arthropoda;Insecta;Lepidoptera;Tortricidae;Pandemis    0.9488694083711045

I was about to write you a question wondering whether or not that third column, Confidence, was something that might be problematic. One of the error messages suggested that there were 2 items passed instead of just 1, and this got me thinking that maybe RESCRIPt just wanted two columns (FeatureID and Taxon), and having that third column was giving the script the headache.

So, I deleted the Confidence column in the .tsv file, reimported it as a new .qza file, and then reran the command. Sure enough, it finished without any error, and produced a pair of taxonomy/sequence .qza files that seem perfectly reasonable.

Thus, I think I have a solution, but would you argue that this is still a potential bug? I don't think this a version incompatibility thing (though note that I created this taxonomy .qza file running QIIME v.2019.4 using qiime feature-classifier classify-sklearn). I'm pretty sure every version of classify-sklearn is going to produce that third Confidence field, right? Maybe @Nicholas_Bokulich could confirm (Hi Dr. B, hope the new setting is lovely across the pond ). Do you think it would be possible to retain whatever the Confidence value is for the resulting FeatureID retained following dereplication in the taxonomy output?

Thanks for the help!

SoilRotifer · October 1, 2020, 1:55pm

Nice catch @devonorourke!

However, I do not think rescript dereplicate was ever intended to be run on classifier output, as you are doing here. That is, we expect a basic taxonomy file with two columns. However, that being said, both are ultimately FeatureData[Taxonomy] types. Which means we should consider handling any FeatureData[Taxonomy] input (two or three columns).

-Mike

devonorourke · October 1, 2020, 1:57pm

I need a Matrix-like way to plug in to the system and learn me some Python. Sorry to raise the issue without providing a patch. Thanks again for your assistance!

SoilRotifer · October 1, 2020, 2:01pm

No worries. I'll try and look into this in the coming weeks.

devonorourke · October 1, 2020, 2:23pm

Quick question Mike - sorry if this is embarrassingly obvious to you:
once I've run qiime rescript dereplicate and get the sequence and taxonomy files, how is it I'm going to create a Feature Table from this information? Is there a rescript command that will let me pass in the sequence file, along with the original feature-table that includes those features (plus the other features that were eliminated via the dereplication process), and produce a new feature-table with just those dereplicated sequences that sums together all the groups in a given dereplication unit (?centroid?).

For example, when I run qiime vsearch dereplicate-sequences, I get both a dereplicated sequence file and feature table. Similarly, I can do this for clustering with qiime vsearch cluster-features-de-novo. I think I can run the cluster-features-de-novo application to reproduce what rescript dereplicate did (given that I clustered the sequences).

Do you know if the FeatureIDs selected in both qiime vsearch cluster-features-de-novo and qiime rescript dereplicate would be identical, assuming the same parameters are passed? If that was true, I should be able to just rerun the qiime vsearch command and get my table without worrying about anything.

Thanks

Thanks again

SoilRotifer · October 1, 2020, 2:58pm

Nope. Mainly because RESCRIPt was not intended to construct feature-tables. Though I can see how this may be useful in some cases.

If memory serves, I think both approaches should retain whichever feature-ids are fed into it. You should be okay.

devonorourke · October 1, 2020, 3:09pm

Got it, thanks.

I just tested the following to commands to compare the number of sequence features that remain...

The vsearch clustering method produced 5,105 features (OTUs? what the heck do we all clustered ASVs from DADA2 anyway?)

qiime vsearch cluster-features-de-novo \
  --i-sequences DADA2_repSeqs.qza \
  --i-table DADA2_table.qza \
  --p-perc-identity 0.985 \
  --o-clustered-table clust985_table.qza \
  --o-clustered-sequences clust985_seqs.qza

However, the command with RESCRIPt produced a different number - 5,483 features:

qiime rescript dereplicate \
  --i-sequences DADA2_repSeqs.qza \
  --i-taxa nb_taxa.qza \
  --p-mode 'super' \
  --p-derep-prefix \
  --p-perc-identity 0.985 \
  --output-dir naiveBayes_clust

Strangely, the 5,105 features in VSEARCH are not a complete subset of the 5,483 features from RESCRIPt dereplicate (there were 4,180). Thus, it appears that my task won't be quite so easy. Back to the drawing board !

One thought - would it be possible to return the .uc file by amending this line in the Python script? What I'd like to be able to do is gather the information that provides some kind of map between the resulting ?centroid? FeatureIDs and all the FeatureIDs that connected to them within my defined percent identity - that's the .uc file, right? Once I have that .uc file, I think I can take it from there in R to aggregate the original feature-table appropriately.

But, again, if I knew Python! I'd be able to probably get that going myself.... argh. ...

It's not a problem though. I ran vsearch external to QIIME and recovered 5,483 clusters - just like what happens with RESCRIPt (which, good thing, because I tried basically copying what you put in the Python script!):

vsearch --cluster_size DADA2_repSeqs.fasta \
  --id 0.985 \
  --strand plus \
  --centroids vsearch_clust985.fasta \
  --uc vsearch_clust985.uc

I'm going to try to figure out how to use this .uc file now to parse my original feature table in R.
Man, this got more involved than I had anticipated this morning!
~~ more ~~

system · November 7, 2020, 5:59pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.