RDP Reference Database in QIIME2 format

KMaki · May 28, 2020, 8:35pm

Hello-
First, I am not sure the best place to classify this post so I apologize if this post is in the wrong spot! I have seen related posts on this subject and hoping if it can be solved may be of use to the QIIME2 community.

I am trying to format the RDP database to be a compatible format to be used with QIIME2, but I am getting a consistent error when running vsearch to classify taxonomy (qiime feature-classifier classify-consensus-vsearch). My qiime commands have been successfully executed to classify on silva and Greengenes ref databases so I don’t think it’s the vsearch script. I had tried to reformat the taxonomy and fasta files and have had this error a couple times but different identifiers were listed. This is the error:

“Plugin error from feature-classifier:
'Identifier 135 was reported in taxonomic search results, but was not present in the reference taxonomy.'
Debug info has been saved to /tmp/qiime2-q2cli-err-usbrditl.log”

Below is my workflow from the beginning:

Went to RDP resources website: RDPresources
I downloaded fasta file for unaligned 16S data that has the taxonomy included in the file (current_Bacteria_unaligned.fa.gz). Then after unzipping, I used grep to extract out the taxonomy headers and edited the file in python to match silva/Greengenes taxonomy from the QIIME2 resources.

Everything looks like it matches up with the taxonomy format for other db's in qiime2 resources and I was able to import the taxonomy as a .qza artifact without issue using below.

qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path RDP/rdp_qiime_taxonomy.txt \
--output-path rdp_16S_v16_ref-taxonomy.qza

Next I formatted the fasta file to be match the other ref databases and I also was able to import to a qiime artifact with the below script with no issues:

qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path RDP/rep_set_99_rdp.fa \
--output-path rdp_16S_v16_otus.qza

DADA2 was then used for denoising and then I ran the below script to classify features:
qiime feature-classifier classify-consensus-vsearch
--i-query merged_v67_rep_seqs_pyro_trun150.qza
--i-reference-reads rdp_16S_v16_otus.qza
--i-reference-taxonomy rdp_16S_v16_ref-taxonomy.qza
--o-classification v67_rdp_vs_dada2trun150_rdp_complete.qza

The files are too large to attach here so figshare links are below. Please let me know if there's an issue opening.
taxonomy
ref_otus

I am still getting the hang of editing in unix, but my taxonomy file looked okay when I had it in python. Any assistance would be appreciated and if we can get it formatted correctly, would be happy to share this as a resource if RDP is a database of interest for QIIME2.

Thanks-
Katherine

Nicholas_Bokulich · May 28, 2020, 9:51pm

Hi @KMaki,
First, I must say: congratulations on making it this far! Once you get the RDP database working seamlessly with QIIME 2 I'd encourage you to share the QZA files somewhere (e.g., zenodo) and link to it on the forum (in a "community contributions" topic) if you are interested. Others in the forum community have asked about how to format the RDP database files for use with QIIME 2, so if you are happy to share these you'd be helping all boats float higher!

Now to the error:

It sounds like at least one sequence ID is in the sequences but not the taxonomy. Two possibilities:

are the feature IDs numeric? If yes, this is likely the issue (they are being interpreted as numbers in one file but characters in the other), and changing to non-numeric would be an easy fix.
special characters, especially line breaks, have been reported to cause this issue in the past, see here for a diagnosis and fix:
Feature classifiier consensus vsearch - key error - #7 by Nicholas_Bokulich

I strongly suspect #2 is the issue — let me know what you find!

KMaki · June 3, 2020, 3:40pm

Thank you so much for the suggestions @Nicholas_Bokulich. I have looked at both and am not sure if I have an answer yet.
For suggestion #1, below are the headers of my OTU.fa and taxonomy files (note, these were copied into notepad so the formatting looks strange). The feature ids seems to match up, but is there a way to check if they are labeled as numeric?

head rep_set_99_rdp.fa
>S000494589
 GCGGCGTGCTACACATGCAGTCGTACGCGGTGGCACACCGAGT
 GGCGAACGGGTGCGTAACACGTGAGGAACCTACCCCGAAGTGGG
 GGATAACACCGGGAAACCGGTGCTAATACCGCATACGCTCCCCGGAC
 CGCATGGTCCAGGGAGCAAAGCCTCCGGGCGCTTCGGGACGGCCTC
 GCGGCCTATCAGCTTGTTGGTGGGGTAACGGCCCACCAAGGCGA
 CGACGGGTAGCTGGTCTGAGAGGACGATCAGCCACACTGGGACT
 GAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGC
 GCAATGGGCGAAAGCCTGACGCAGCAACGCCGCGTGGAGGACGAAG
 GCCTTCGGGTTGTAAACTCCTTTCAGCAGGGACGAAACTGACGGTACC
 TGCAGAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAAG
>S000632122
GACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGTACGCGGT
GGCAACACCGAGTGGCGAACGGGTGCGTAACACGTGAGGAACCTAC
CCCGAAGTGGGGGATAACACCGGGAAACCGGTGCTAATACCGCATA
CGCTCCCCGGACCGCATGGTCCA

head rdp_qiime_taxonomy.txt
S000494589      
Bacteria;domain;Actinobacteria;phylum;Actinobacteria;class;Acidi                                        
microbidae;subclass;Acidimicrobiales;order;Acidimicrobineae;suborder;
Acidimicrobiaceae;family;Acidimicrobium;genus
S000632122      
Bacteria;domain;Actinobacteria;phylum;Actinobacteria;class;
Acidimicrobidae;subclass;Acidimicrobiales;order;Acidimicrobineae;                                  
suborder;Acidimicrobiaceae;family;Acidimicrobium;genus

For suggestion #2, I used the dos2unix code as you suggested:

dos2unix rep_set_99_rdp.fa

The file shows that it was updated, but do I need to add any flags/options? After converting the file, I imported without problem, and tried to run vsearch taxonomy classification and got the following errors. Note, I ran classification on several v-regions simultaneously and so there is one error for each job. Does the classification automatically terminate if something in the taxonomic search results is not present in reference taxonomy. Is this error saying that identifiers S002156889, S000615995, and S000269333 are missing from one or both of the files?

Plugin error from feature-classifier:
'Identifier S002156889 was reported in taxonomic search results, but was 
not present in the reference taxonomy.'
Debug info has been saved to /tmp/qiime2-q2cli-err-xjrxpvce.log

Plugin error from feature-classifier:
'Identifier S000615995 was reported in taxonomic search results, but was 
not present in the reference taxonomy.'
Debug info has been saved to /tmp/qiime2-q2cli-err-9evq16yi.log

Plugin error from feature-classifier:
'Identifier S000269333 was reported in taxonomic search results, but was 
not present in the reference taxonomy.'
Debug info has been saved to /tmp/qiime2-q2cli-err-irecsajs.log

Interestingly, if I extract the variable region from the RDP database based on primers and run the Vsearch workflow, I am able to classify taxonomy. Unfortunately, we cannot recommend this process for the IonTorrent ThermoFisher analysis workflow because ThermoFisher’s primers are proprietary and therefore cannot be inputted to extract V regions from RDP. I am curious, though, why things would run without issue on the extracted RDP database, but not on the full RDP database. Logically, if there was an missing feature issue it would make more sense for it to be on the smaller, extracted database right?

I also attached the qza files for the otus and taxonomy. I am not sure if these files or the otu/taxonomy files would give a hint to why I am getting these errors?
QZA fasta OTUs
QZA taxonomy

Thank you so much for helping me troubleshoot!
Katherine

Nicholas_Bokulich · June 3, 2020, 4:06pm

Hi @KMaki,

They do not look numeric, so assuming all follow that format that is not the issue.

No, sounds like it worked! You have a related (but new) error, so that means that the windows-style line breaks were the issue before.

That's right. This error is specifically saying that those IDs are in the sequences but not the taxonomy.

probably because those IDs are no longer included in the top matches; they may be excluded by extract-reads (e.g., if they are too short/long) or trimming off the rest of the sequence alters the kmer profile (which vsearch uses to queue up the first N hits for alignment).

This command might help you find all IDs that are unique in one file or another (no guarantees this will work, I'm sort of ad-libbing based on the file snippets you shared above):

grep '^>' rep_set_99_rdp.fa | tr -d '>' | cat - rdp_qiime_taxonomy.txt | grep -v ';' | sort | uniq -u

Then you can confirm that those IDs are really missing from one file but not the other like this (you only need to run this once or twice, just to make sure the command above worked):

id='paste-the-id-here'
for f in rep_set_99_rdp.fa rdp_qiime_taxonomy.txt
do
  grep $id $f | wc -l
done

Want to give that a spin and see how many IDs are missing?

llenzi · June 3, 2020, 6:59pm

Hi @KMaki and @Nicholas_Bokulich

Sorry to jump in, just a thing I noticed looking at the taxonomy format above. I think you can remove the domain/phylum etc ... These are only doubling the number of taxonomic levels and I am not sure if some plug in does not like more than the suggested 7 levels, that my explain why you could assign taxonomy with the extracted region and not the full 16S. I hope
@Nicholas_Bokulich may enlighten us on this. Now, my very dull question: in your qiime-formatted taxonomy file, are the ID and the respective taxonomy in the same line? I'm asking because from the distribution of the text above does not seem the case, and this may explain why the ids are not found in the reference taxonomy (sorry if this is still related to the notepad issue you mentioned!)
good luck

Nicholas_Bokulich · June 3, 2020, 7:15pm

Yes! Thanks for noting that, I glossed over this before. @KMaki those extra labels should be removed.

as long as the number of levels is even it does not matter for most (or any) plugins as far as I know (there may be exceptions in community-developed plugins but none I am aware of)

I assumed this was part of the auto-formatting applied by notepad as @KMaki mentioned — but @llenzi indeed you are correct these must be on the same line (though I don't think that's causing the issue we are seeing now, as the issue is missing IDs in the file, not missing annotations, and QIIME 2 would raise an error if importing a taxonomy with IDs and annotation on separate lines)

KMaki · June 4, 2020, 8:04pm

Thank you both so much for your help @Nicholas_Bokulich and @llenzi (you're welcome to jump in any time!) Sorry for the delay here, I was troubleshooting after I made a couple discoveries

Ok, so I did some digging and I think there is definitely a big mismatch error which is causing my issue where a large portion of the taxonomy labels matching the ids were somehow lost in the creation of the taxonomy file.

@Nicholas_Bokulich, I ran your initial code below and sent it to a file because there was a ton of output. After running a line count, I got the following result: 3196041

This made me do a word count command on my taxonomy file and then the rep_set_99_rdp.fa and the taxonomy file was MUCH smaller (by like 2 million lines haha)

So I think the error was that somehow when I was converting the taxonomy file in python and sending back to my server , a large chunk of the information was truncated.

I went back and did all of the editing in Unix, imported to a qza file without issue and was able to do the vsearch and taxonomy classification! I will have to figure out how to remove the "subclass" "suborder" etc. because this is confusing QIIME2 view and making it seem like theres several "levels".

Bacteria;domain;"Actinobacteria";phylum;Actinobacteria;class;Actinobacteridae;subclass;Actinomycetales;order;Actinomycineae;suborder;Actinomycetaceae;family;Actinomyces

Is there a way to take out the suborder and subclass with the corresponding taxonomic names in unix?

This is a small head of my taxonomy file now (the file is now formatting with tax is then tab and taxonomy)
S000494589 Bacteria;domain;"Actinobacteria";phylum;Actinobacteria;class;Acidimicrobidae;subclass;Acidimicrobiales;order;"Acidimicrobineae";suborder;Acidimicrobiaceae;family;Acidimicrobium;genus
S000632122 Bacteria;domain;"Actinobacteria";phylum;Actinobacteria;class;Acidimicrobidae;subclass;Acidimicrobiales;order;"Acidimicrobineae";suborder;Acidimicrobiaceae;family;Acidimicrobium;genus
S000632121 Bacteria;domain;"Actinobacteria";phylum;Actinobacteria;class;Acidimicrobidae;subclass;Acidimicrobiales;order;"Acidimicrobineae";suborder;Acidimicrobiaceae;family;Acidimicrobium;genus
S000541758 Bacteria;domain;"Actinobacteria";phylum;Actinobacteria;class;Acidimicrobidae;subclass;Acidimicrobiales;order;"Acidimicrobineae";suborder;Acidimicrobiaceae;family;Acidimicrobium;genus

Thank you again, I can't believe this was the issue all along

llenzi · June 5, 2020, 8:42am

Hi @KMaki,
I'm glad it worked!
In your place I probably remove the the domain, phylum, class and so on in the python script, but that because I'm not using bash to much to this (matter of preference I suppose).
Still, is certainly doable to do via bash, something like:

cat rdp_qiime_taxonomy.txt | sed s/Bacteria// > out.file

should work (sorry I have not tested), then repeat for all the taxonomy levels. I would probably remove the curly quote too just in case ...
Cheers

Matt · June 12, 2020, 1:45pm

Hi @KMaki, great post and looks like a lot of work (I've tried to follow along as I'm trying to do the same thing, but I'm not having much luck). You mentioned up the top you'd be happy to share as a resource if you get it formatted correctly (and it sounds like you were pretty close, with just the extra taxonomy classifications to be removed). If the formatting has gone well, would you be happy to share it (pretty please)?

Many thanks,

Matt

KMaki · June 15, 2020, 1:04pm

Hi Matt! Sorry for the delay here!
I haven't had a chance to remove the extra taxonomy classifications and am kinda swamped this week with a big deadline Friday but its on my "to-do" list for next week to finish formatting this. In the meantime, attached is a link to the taxonomy and fasta files at the point where I currently am in case using the files in QIIME2 format is time sensitive:
RDP OTU fasta file
RDP updated taxonomy file

If you figure out how to remove those extra taxonomy labels in the meantime and don't mind sharing the code, I would be grateful! Otherwise, will continue working on this next week. Let me know if these don't work for some reason!
Katherine

Celalokasart · June 17, 2020, 1:23pm

Thank you so much for the suggestions @Nicholas_Bokulich. I have looked at both and am not sure if I have an answer yet.

Matt · June 19, 2020, 2:00pm

Hi @KMaki,

Thanks heaps! I think I've managed to remove the extra taxonomy columns. I completed it in R, using this code:

library(tidyverse)

# import data
tax <- read_delim("taxonomy_data/rdp_qiime_taxonomy.txt", delim = ";")

# remove unwanted extra columns
tax2 <- tax %>%
  select(-domain,-phylum,-class,-subclass,-order,-suborder,-family,-genus)

# export data (in ; delimited format)
write_delim(tax2, path = "taxonomy_data/rdp_qiime_taxonomy2.txt", delim = ";")

The file it outputs is here: rdp_qiime_taxonomy2.txt

I haven't yet had a chance to try it out yet though, will try soon.

Cheers,

Matt

KMaki · June 24, 2020, 9:39pm

Thank you so much for the suggestion @Matt! Will look at this in Unix and try to run it through the workflow to see if the taxonomy looks cleaner. Really appreciate the help. Hope you have success with your RDP workflow!

Matt · June 24, 2020, 10:22pm

I thought I'd give an update on my 'progress' with this. Importing taxonomy and extracting reads has worked fine, however now I've run into a (potentially unrelated) issue with feature-classifier fit-classifier-naive-bayes, which I'm trying to work out. I'm basing my workflow off this tutorial.

Import RDP data into qza format

# rdp taxonomy strings
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path rdp_qiime_taxonomy2.txt \
--output-path rdp-ref-taxonomy.qza

# download and import KMaki's fasta file of RDP sequences
$ wget --no-check-certificate https://ndownloader.figshare.com/files/23146310?private_link=86d9b343729e4c67ea08 -O rep_set_99_rdp.fa

qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path rep_set_99_rdp.fa \
--output-path rdp_16S_otus.qza

Extract reads from the relevant region (V3-V5) out of the ref database

$ qiime feature-classifier extract-reads \
--i-sequences rdp_16S_otus.qza \
--p-f-primer CCTACGGGNGGCWGCAG  \
--p-r-primer GACTACHVGGGTATCTAATCC   \
--p-min-length 300 \
--p-max-length 600 \
--o-reads ref_seqs.qza \
--verbose \
&> 16S_training.log

Train classifier on extracted region:

$ qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ref_seqs.qza \
--i-reference-taxonomy rdp-ref-taxonomy.qza \
--o-classifier rdp_classifier_16S.qza \
--verbose \
&> 16S_classifier.log

This produces an error message regarding memory (see below). Based on these posts here, I added --p-classify–-chunk-size XXXX to the command (tried XXXX = 5000, 2000, 1000, 500, 100, 50 and 10), however it still produced the exact same error (I ran the logs through a diffchecker and they are indentical)

Here is the error message

/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.22.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
  warnings.warn(warning, UserWarning)
Traceback (most recent call last):
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/commands.py", line 328, in __call__
    results = action(**arguments)
  File "</homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/decorator.py:decorator-gen-345>", line 2, in fit_classifier_naive_bayes
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
    output_types, provenance)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/classifier.py", line 331, in generic_fitter
    pipeline)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/_skl.py", line 32, in fit_pipeline
    pipeline.fit(X, y)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/pipeline.py", line 350, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/pipeline.py", line 315, in _fit
    **fit_params_steps[name])
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/joblib/memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/pipeline.py", line 728, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 809, in fit_transform
    return self.fit(X, y).transform(X)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 784, in transform
    X = self._get_hasher().transform(analyzer(doc) for doc in X)
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/sklearn/feature_extraction/_hash.py", line 155, in transform
    self.alternate_sign, seed=0)
  File "sklearn/feature_extraction/_hashing_fast.pyx", line 83, in sklearn.feature_extraction._hashing_fast.transform
  File "<__array_function__ internals>", line 6, in resize
  File "/homevol/matt/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 1415, in resize
    a = concatenate((a,) * n_copies)
  File "<__array_function__ internals>", line 6, in concatenate
MemoryError: Unable to allocate 8.00 GiB for an array with shape (1073741824,) and data type float64

Plugin error from feature-classifier:

  Unable to allocate 8.00 GiB for an array with shape (1073741824,) and data type float64

See above for debug info.

I'm unsure why I still get a memory error despite giving the --p-classify–-chunk-size XXXX command.

Nicholas_Bokulich · June 25, 2020, 7:42pm

Setting reads-per-batch appears to work well for limiting memory consumption by classify-sklearn because reads are processed and saved out in chunks. I have not looked into it in detail yet but classify--chunk-size does not seem to offer as much of an advantage to fit-classifier-naive-bayes because the trained classifier is still stored in memory (I think).

So the bad news is I think you will just need a machine with > 8 GB to train your classifier. You may be able to filter the data somehow to reduce the memory demand (e.g., trim to your amplicon of interest before training, filter outliers or other noise, filter unassigned taxa). @SoilRotifer and I just released a new plugin, RESCRIPt, that might help with this.

Jasmine · June 26, 2020, 7:00pm

Thank you all for posting! I have been trying to parse the exact same text in with Python3's regex module.

Asve · July 23, 2020, 1:21pm

I get the same error even with 15 GB. have you found a solution?

system · August 28, 2020, 3:21am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.