A collection of sequence and taxonomy data were obtained running @BenKaehler’s fantastic get-ncbi-data
command within RESCRIPt.
- Raw sequence data collected from
get-ncbi-data
are available here - Raw taxonomy data collected from
get-ncbi-data
are available here
I had intended on filtering and dereplicating these sequences using a series of RESCRIPt commands: cull-seqs
and filter-seqs-length
to filter, and dereplicate
to … well, dereplicate.
While the filtering commands completed successfully, I received an error when trying to dereplicate (I’ve posted the complete error message at the end of this post). After running the dereplication command in VSEARCH independent of QIIME (using the exact same settings described by the QIIME command), I was able to generate the necessary output .fasta and .uc files. So it was clear that the problem wasn’t with VSEARCH, it was with dealing with what happens after VSEARCH dereplication is complete.
I believe the error message I received indicated that a FeatureID present in the sequences .qza file was missing from the taxonomy .qza file. When looking at the total number of FeatureIDs in those original files obtained from get-ncbi-taxa
, the FeatureID flagged in the error message was present in the sequence .qza file, but absent from the taxonomy file. In fact, there are 8 more FeatureIDs in the sequences file than the taxonomy file.
For @BenKaehler, I wonder if the get-ncbi-data
scripts have an internal check to ensure that FeatureIDs are present in both the _seq.qza and _tax.qza files produced? Maybe those sequences didn’t have any taxonomy string, got filtered out from the _tax.qza file, but weren’t also removed from the _seqs.qza file?
For QIIME developers broadly, could you point me in the direction of any tool in the QIIME environment that might perform such a filter? Specifically, if I feed in a pair of sequence and taxonomy artifact files, is there a way to remove those sequences that lack an associated taxonomy Feature ID (and vice versa)?
Many thanks!
vsearch v2.7.0_linux_x86_64, 125.8GB RAM, 28 cores
https://github.com/torognes/vsearch
Reading file /tmp/dro49/32300484/qiime2-archive-bidyrm9g/d351bff1-e510-450b-82fc-7e77a79682b5/data/dna-sequences.fasta 100%
1625327668 nt in 2636250 seqs, min 250, max 1600, avg 617
Sorting by length 100%
Dereplicating 100%
Sorting 100%
1366562 unique sequences, avg cluster 1.9, median 1, max 4280
Writing output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%
Traceback (most recent call last):
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'M27461.1'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/q2cli/commands.py", line 329, in __call__
results = action(**arguments)
File "<decorator-gen-151>", line 2, in dereplicate
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
output_types, provenance)
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
output_views = self._callable(**view_args)
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/rescript/dereplicate.py", line 54, in dereplicate
taxa, sequences, clustered_seqs, uc, mode=mode)
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/rescript/dereplicate.py", line 116, in _dereplicate_taxa
uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/series.py", line 4045, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/rescript/dereplicate.py", line 116, in <lambda>
uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1424, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1850, in _getitem_axis
return self._get_label(key, axis=axis)
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 160, in _get_label
return self.obj._xs(label, axis=axis)
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/generic.py", line 3737, in xs
loc = self.index.get_loc(key)
File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'M27461.1'
Plugin error from rescript:
'M27461.1'
See above for debug info.
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.
Command: vsearch --derep_prefix /tmp/dro49/32300484/qiime2-archive-bidyrm9g/d351bff1-e510-450b-82fc-7e77a79682b5/data/dna-sequences.fasta --output /tmp/dro49/32300484/tmpf0b39_kl --uc /tmp/dro49/32300484/tmp9nij1py2 --qmask none --xsize --threads 1