The Silva classifier I trained was used for fungal, but the species classification results were all bacteria

LiyingXie · April 22, 2021, 9:34am

Hello， everyone

I'm processing a batch of 18S fungal data, so I hope to train a suitable classifier. So I downloaded the sequences and taxonomy files of Silva on the website: Data resources — QIIME 2 2021.2.0 documentation.

Then I run it according to the tutorial on the official website as follows:

time qiime feature-classifier extract-reads
--i-sequences silva-138-99-seqs.qza
--p-f-primer CTGGTTGATCCTGCCAG
--p-r-primer ACCAGACTTGCCCTCC
--p-min-length 0
--p-max-length 500
--o-reads ref-seqs.qza

time qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref-seqs.qza
--i-reference-taxonomy silva-138-99-tax.qza
--o-classifier classifier.qza

time qiime feature-classifier classify-sklearn
--i-classifier classifier.qza
--i-reads rep_set.qza
--o-classification taxonomy.qza

time qiime metadata tabulate
--m-input-file taxonomy.qza
--o-visualization taxonomy.qzv

But I got a puzzling result. The primers and sequences I used in the training were all 18S fungal. But the annotation results are all bacteria.taxonomy.qzv (1.3 MB)

I am eager to know what causes this? How to solve this terrible problem?

Looking forward to your reply. Any suggestion is likable!

SoilRotifer · April 22, 2021, 4:28pm

Hi @LiyingXie,

I performed some basic sanity checking of what was actually in the reference database after retaining only eukaryotes in the reference files like so:

# visualize silva taxonomy
qiime metadata tabulate \
	--m-input-file silva-138-99-tax.qza \
	--o-visualization silva-138-99-tax.qzv

# extract amplicon region
qiime feature-classifier extract-reads \
	--i-sequences silva-138-99-seqs.qza \
	--p-f-primer CTGGTTGATCCTGCCAG \
	--p-r-primer ACCAGACTTGCCCTCC \
	--p-min-length 0 \
	--p-max-length 500 \
	--p-n-jobs 4 \
	--o-reads ref-seqs.qza

# only retain amplicon region that are eukaryotes
qiime taxa filter-seqs \
	--i-sequences ref-seqs.qza \
	--i-taxonomy silva-138-99-tax.qza \
	--p-include Eukaryota \
	--o-filtered-sequences ref-seqs-euks.qza

# visualize eukaryote reference sequences
qiime metadata tabulate \
	--m-input-file ref-seqs-euks.qza \
	--o-visualization ref-seqs-euks.qzv

# keep only eukaryotic taxonomy based on retained eukaryote sequences
qiime rescript filter-taxa \
	--i-taxonomy silva-138-99-tax.qza \
	--m-ids-to-keep-file ref-seqs-euks.qza \
	--o-filtered-taxonomy ref-seqs-euks-tax.qza

# visualize only eukaryote taxonomy
qiime metadata tabulate \
	--m-input-file ref-seqs-euks-tax.qza \
	--o-visualization ref-seqs-euks-tax.qzv

This resulted in about ~1,629 eukaryote references. Not many of which where fungal taxa (~111), at least compared to the initial unfiltered reference taxonomy (~11,791). I did this by crudely searching for the taxonomy string "cota" (i.e. Ascomycota, Basidiomycota, etc....)

This tells me that the universal eukaryote primers used are not ideal for fungi. Alternatively, using primer sequences to extract the reference sequences you need to classify fungi may not be ideal. You can try downloading the sequence alignment from SILVA and determine the alignment positions that your primers would be. Then you can use qiime rescript trim-alignment to extract the region of the alignment that your primers would target. This removes the need for primer specificity.

Note, we do not yet have a completely developed pipeline within QIIME 2 / RESCRIPt to automate downloading and processing SILVA RNA alignments. We are currently developing this in RESCRIPt in order to make this approach easier. So you'll have to initially process things on your own, or use this older alignment trimming approach, until we incorporate this into RESCRIPt.

Also, I would run BLAST on several of your sequences to see if they do indeed "hit" fungi. If they hit Bacteria, then it is likely confirming that these primers are not ideal to target fungi. In which case the generation of a different reference database may not help.

LiyingXie · April 23, 2021, 7:13am

Hello again，

SoilRotifer:

# keep only eukaryotic taxonomy based on retained eukaryote sequences
qiime rescript filter-taxa \
	--i-taxonomy silva-138-99-tax.qza \
	--m-ids-to-keep-file ref-seqs-euks.qza \
	--o-filtered-taxonomy ref-seqs-euks-tax.qza

# visualize only eukaryote taxonomy
qiime metadata tabulate \
	--m-input-file ref-seqs-euks-tax.qza \
	--o-visualization ref-seqs-euks-tax.qzv

I tried to run the command you provided. But I didn't have a good time installing RESCRIPt. I need to install it in qiime2. I'm using version 2021.2 and running on the Oracle VM VirtualBox.

Next, I follow the tutorial in the website to install (GitHub - bokulich-lab/RESCRIPt: REference Sequence annotation and CuRatIon Pipeline).

conda activate qiime2-2021.2
conda install -c conda-forge -c bioconda -c qiime2 -c defaults xmltodict

I ran these two commands and the following appeared：

Collecting package metadata (current_repodata.json): done
Solving environment: done

==> WARNING: A newer version of conda exists. <==
current version: 4.9.2
latest version: 4.10.1

Please update conda by running

$ conda update -n base -c defaults conda

Package Plan

environment location: /home/qiime2/miniconda/envs/qiime2-2021.2

added / updated specs:
- xmltodict

The following packages will be downloaded:

package                    |            build
---------------------------|-----------------
openssl-1.1.1k             |       h7f98852_0         2.1 MB  conda-forge
xmltodict-0.12.0           |             py_0          11 KB  conda-forge
------------------------------------------------------------
                                       Total:         2.1 MB

The following NEW packages will be INSTALLED:

xmltodict conda-forge/noarch::xmltodict-0.12.0-py_0

The following packages will be UPDATED:

certifi pkgs/main::certifi-2020.12.5-py36h06a~ --> conda-forge::certifi-2020.12.5-py36h5fab9bb_1
openssl pkgs/main::openssl-1.1.1j-h27cfd23_0 --> conda-forge::openssl-1.1.1k-h7f98852_0

The following packages will be SUPERSEDED by a higher-priority channel:

ca-certificates pkgs/main::ca-certificates-2021.1.19-~ --> conda-forge::ca-certificates-2020.12.5-ha878542_0

Proceed ([y]/n)? y

Downloading and Extracting Packages
xmltodict-0.12.0 | 11 KB | ###################################################################################################################### | 100%
openssl-1.1.1k | 2.1 MB | ###################################################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

And then, I ran the command:

conda update -n base -c defaults conda
pip install git+https://github.com/bokulich-lab/RESCRIPt.git

After running the two commands, the following is displayed:

Collecting git+https://github.com/bokulich-lab/RESCRIPt.git
Cloning GitHub - bokulich-lab/RESCRIPt: REference Sequence annotation and CuRatIon Pipeline to /tmp/pip-req-build-kxpa30gs
Running command git clone -q GitHub - bokulich-lab/RESCRIPt: REference Sequence annotation and CuRatIon Pipeline /tmp/pip-req-build-kxpa30gs
Building wheels for collected packages: rescript
Building wheel for rescript (setup.py) ... done
Created wheel for rescript: filename=rescript-2021.4.0.dev0+6.g073ccf0-py3-none-any.whl size=157116 sha256=1d9116e47ba5170f2a2a849c4b347107279627a7f457900586111a804aaa68ee
Stored in directory: /tmp/pip-ephem-wheel-cache-2r0x4srb/wheels/db/0b/eb/f9d025609a8a1939d5ada85024251d0a18aff46b24d8170b47
Successfully built rescript
Installing collected packages: rescript
Successfully installed rescript-2021.4.0.dev0+6.g073ccf0

I thought I had rescript installed correctly, so I started processing my data. But no matter what command I type next, it will show an error. The tips are as follows:

QIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.
Traceback (most recent call last):
File "/home/qiime2/miniconda/envs/qiime2-2021.2/bin/qiime", line 11, in
sys.exit(qiime())
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/click/core.py", line 1254, in invoke
cmd_name, cmd, args = self.resolve_command(ctx, args)
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/click/core.py", line 1297, in resolve_command
cmd = self.get_command(ctx, cmd_name)
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/commands.py", line 100, in get_command
plugin = self._plugin_lookup[name]
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/commands.py", line 76, in _plugin_lookup
import q2cli.core.cache
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 406, in
CACHE = DeploymentCache()
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 61, in init
self._state = self._get_cached_state(refresh=refresh)
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 107, in _get_cached_state
self._cache_current_state(current_requirements)
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 200, in _cache_current_state
state = self._get_current_state()
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 238, in _get_current_state
plugin_manager = qiime2.sdk.PluginManager()
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/plugin_manager.py", line 54, in new
self._init(add_plugins=add_plugins)
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/plugin_manager.py", line 81, in _init
plugin = entry_point.load()
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/pkg_resources/init.py", line 2472, in load
return self.resolve()
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/pkg_resources/init.py", line 2478, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/rescript/plugin_setup.py", line 24, in
from .get_data import get_silva_data
File "/home/qiime2/miniconda/envs/qiime2-2021.2/lib/python3.6/site-packages/rescript/get_data.py", line 19, in
from q2_types.feature_data import RNAFASTAFormat
ImportError: cannot import name 'RNAFASTAFormat'

Why? Is it because the tutorial is only available for version 2021.4 of qiime2? But I see that the official website has not been updated to 2021.4(https://s3-us-west-2.amazonaws.com/qiime2-data/distro/core/virtualbox-images.txt).

Looking forward to your reply, thank you！

SoilRotifer · April 23, 2021, 2:31pm

Hi @LiyingXie, sorry about that. I forgot to mention we've recently updated RESCRIPt to work with QIIME 2 version 2021.4 and later. You can pick from one of several prior release versions of RESCRIPt, which often match the associated QIIME 2 version.

Since you are running QIIME 2 2021.2 you can install the appropriate version of RESCRIPT via the following command:

pip install git+https://github.com/bokulich-lab/RESCRIPt.git@2021.2.0

Let us know if this works for you.

-Cheers!
-Mike

LiyingXie · April 23, 2021, 3:51pm

Hello Mike,

I have successfully installed it. Thank you very much.

SoilRotifer:

# keep only eukaryotic taxonomy based on retained eukaryote sequences
qiime rescript filter-taxa \
	--i-taxonomy silva-138-99-tax.qza \
	--m-ids-to-keep-file ref-seqs-euks.qza \
	--o-filtered-taxonomy ref-seqs-euks-tax.qza

# visualize only eukaryote taxonomy
qiime metadata tabulate \
	--m-input-file ref-seqs-euks-tax.qza \
	--o-visualization ref-seqs-euks-tax.qzv

Then, I ran the command you provided earlier. Got the classifier- euks.qza . And a comparison was made. The results are still bad.taxonomy-euks.qzv (1.4 MB)

I think there may be problems with this batch of data, such as inappropriate primers as you said.

Maybe I still need your help. I look forward to your reply. Sincerely.

SoilRotifer · April 23, 2021, 4:11pm

Great, and you're welcome!

I agree. Again, I would suggest that you manually run BLAST for a few of these sequences, just to be sure that it is not due to lack of appropriate reference sequences within the SILVA database.

Good luck!

LiyingXie · April 24, 2021, 3:34am

Hello,

I'm sorry to keep bothering you. I selected 21 sequences from the sample and uploaded them to blast for comparison.

First, I used the nucleoside collection database comparison. seqdump.txt (31 Bytes)

I checked the distance tree of results file. The results are shown in the figure below, and the one marked in red is eukaryote.

And then, I compared the same batch of data with 18S ribosomal RNA sequences (SSU) from functional type and reference material database.fungal-seqdump.txt (179.5 KB)

I checked the distance tree of results file. The results are shown in the figure below, and the one marked in red is fungal.

This is my first time using blast. I am puzzled whether this result can show that the primers selected for this batch of 18S fungi data are not suitable.

I really hope you can give me some help. Thank you.

SoilRotifer · April 24, 2021, 7:05pm

Based on these results, I would concur that these primers are not ideal for targeting fungi. Though it is harder to say with your second example, especially if the only sequences you are comparing are known fungal sequences. I'd add other known eukaryotes to confirm.

There are a variety of fungal-specific SSU and LSU primers available. Additionally, there are also ITS primers you can try. I'd search the literature, or this forum and see if those primers would work for your project.