dereplication error with data collected from `get-ncbi-data` command

devonorourke · August 18, 2020, 12:50pm

A collection of sequence and taxonomy data were obtained running @BenKaehler's fantastic get-ncbi-data command within RESCRIPt.

Raw sequence data collected from get-ncbi-data are available here
Raw taxonomy data collected from get-ncbi-data are available here

I had intended on filtering and dereplicating these sequences using a series of RESCRIPt commands: cull-seqs and filter-seqs-length to filter, and dereplicate to ... well, dereplicate.

While the filtering commands completed successfully, I received an error when trying to dereplicate (I've posted the complete error message at the end of this post). After running the dereplication command in VSEARCH independent of QIIME (using the exact same settings described by the QIIME command), I was able to generate the necessary output .fasta and .uc files. So it was clear that the problem wasn't with VSEARCH, it was with dealing with what happens after VSEARCH dereplication is complete.

I believe the error message I received indicated that a FeatureID present in the sequences .qza file was missing from the taxonomy .qza file. When looking at the total number of FeatureIDs in those original files obtained from get-ncbi-taxa, the FeatureID flagged in the error message was present in the sequence .qza file, but absent from the taxonomy file. In fact, there are 8 more FeatureIDs in the sequences file than the taxonomy file.

For @BenKaehler, I wonder if the get-ncbi-data scripts have an internal check to ensure that FeatureIDs are present in both the _seq.qza and _tax.qza files produced? Maybe those sequences didn't have any taxonomy string, got filtered out from the _tax.qza file, but weren't also removed from the _seqs.qza file?

For QIIME developers broadly, could you point me in the direction of any tool in the QIIME environment that might perform such a filter? Specifically, if I feed in a pair of sequence and taxonomy artifact files, is there a way to remove those sequences that lack an associated taxonomy Feature ID (and vice versa)?

Many thanks!

vsearch v2.7.0_linux_x86_64, 125.8GB RAM, 28 cores
https://github.com/torognes/vsearch

Reading file /tmp/dro49/32300484/qiime2-archive-bidyrm9g/d351bff1-e510-450b-82fc-7e77a79682b5/data/dna-sequences.fasta 100%
1625327668 nt in 2636250 seqs, min 250, max 1600, avg 617
Sorting by length 100%
Dereplicating 100%
Sorting 100%
1366562 unique sequences, avg cluster 1.9, median 1, max 4280
Writing output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%
Traceback (most recent call last):
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'M27461.1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/q2cli/commands.py", line 329, in __call__
    results = action(**arguments)
  File "<decorator-gen-151>", line 2, in dereplicate
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
    output_types, provenance)
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/rescript/dereplicate.py", line 54, in dereplicate
    taxa, sequences, clustered_seqs, uc, mode=mode)
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/rescript/dereplicate.py", line 116, in _dereplicate_taxa
    uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/series.py", line 4045, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/rescript/dereplicate.py", line 116, in <lambda>
    uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1424, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1850, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 160, in _get_label
    return self.obj._xs(label, axis=axis)
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/generic.py", line 3737, in xs
    loc = self.index.get_loc(key)
  File "/scratch/dro49/conda/envs/rescript_2020.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'M27461.1'

Plugin error from rescript:

  'M27461.1'

See above for debug info.
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --derep_prefix /tmp/dro49/32300484/qiime2-archive-bidyrm9g/d351bff1-e510-450b-82fc-7e77a79682b5/data/dna-sequences.fasta --output /tmp/dro49/32300484/tmpf0b39_kl --uc /tmp/dro49/32300484/tmp9nij1py2 --qmask none --xsize --threads 1

BenKaehler · August 18, 2020, 8:31pm

Hi @devonorourke, you are correct. get-ncbi-data does not guarantee that the set of accession ids present in the sequences file will exactly match those in the taxonomy file. It does, however, print a warning, like this one:

WARNING:2020-08-17 08:26:39,783:MainProcess:The following accessions did not have valid taxids: MT251879.1, MT250343.1, DI201849.1, DI201848.1, DI201847.1, DI201846.1, DI201845.1, M27461.1.
The bad taxids were: 12908, 81077, 0.

This is a feature, because otherwise you would lose some sequences that you may wish to add valid taxonomies for (by hand). Sadly, it is also a feature of some widely-used production databases.

I have offered to @Nicholas_Bokulich to contribute a method to RESCRIPt to perform the intersection filtering.

As a work around until that's done, please export the fasta from the sequences file, delete the sequences named above, and re-import. That should, at least, confirm your suspicions about what's going wrong.

Thanks very much!

devonorourke · August 18, 2020, 9:58pm

Thanks very much @BenKaehler for the input.

It never would have occurred to me to retain sequences without taxonomic identities, but glad to know this is something to be aware of.

My workaround was to export the _tax.qza file and make a separate file listing all those FeatureIDs present:

qiime tools export --input-path _tax.qza --output-path rawtax_dir
cut -f 1 ./rawtax_dir/taxonomy.tsv > keepers.txt

I then filtered the raw _seq.qza file using that list of FeatureIDs present in _tax.qza (keepers.txt):

qiime feature-table filter-seqs \
--i-data raw_seqs.qza \
--m-metadata-file keepers.txt \
--o-filtered-data _seqs_fixed.qza

I then went forward with dereplicating the resolved _seqs_fixed.qza file and the process worked.

Cheers

Nicholas_Bokulich · August 19, 2020, 7:40am

devonorourke:

My workaround was to export the _tax.qza file and make a separate file listing all those FeatureIDs present:
qiime tools export --input-path _tax.qza --output-path rawtax_dir
cut -f 1 ./rawtax_dir/taxonomy.tsv > keepers.txt

no need to export, just use the taxonomy QZA as your metadata file on this line:

--m-metadata-file _tax.qza would do the same thing without breaking provenance.

devonorourke · August 19, 2020, 1:41pm

You can just use the .qza directly !?
Thanks for the pro tip!

Nicholas_Bokulich · August 19, 2020, 1:50pm

not any QZA, mind you. There are specific types that are transformable to Metadata... but most types can transform (either as sample or feature metadata, depending on the file), and FeatureData[Taxonomy] is a type that can be viewed as Metadata.

If you ever feel like some data type would be really useful as metadata (e.g., for filtering or tabulating) but you're not sure if it is transformable, just try using it and QIIME 2 will tell you if you've done something objectionable!

BenKaehler · August 25, 2020, 1:15am

To anyone following this thread, the behaviour of get-ncbi-data has changed. If it can't find a taxonomy that it understands for a particular sequence, it discards that sequence from the output. If you run it in --verbose mode it will print a warning if that happens.

Note that this happens very rarely, like 8 times out of the 2.6 million sequences downloaded in this example.

So the above problem shouldn't be a problem any more. (Although it's nice that @devonorourke picked up some tips from @Nicholas_Bokulich as a result.)

system · September 25, 2020, 7:15am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.