Most features were not assigned in ITS fungal community

suzukik · September 13, 2018, 12:42pm

Hi.
I tired to get taxonomic information of ITS fungal community using "qiime feature-classifier classify-sklearn".
I got the database from UNITE, latest release ver. 7.2 (PlutoF DOI).
I trained the database by using this tutorial (Training feature classifiers with q2-feature-classifier — QIIME 2 2018.6.0 documentation). The commands I used were bellow.

qiime tools import
--type 'FeatureData[Sequence]'
--input-path sh_refs_qiime_ver7_97_01.12.2017.fasta
--output-path 97_otus.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--source-format HeaderlessTSVTaxonomyFormat
--input-path sh_taxonomy_qiime_ver7_97_01.12.2017.txt
--output-path ref-taxonomy.qza

However, after I got the results (taxa-barplot), most sequences were not assigned at phylum level, but shown as "Fungi;" (please see attached figure).
I think that it is something wrong.
When I searched some unassigned feature in BLAST, some of them had high similarity to known species.

Do someone know the reason why many sequence were not assigned?

Thank you for your help.

Nicholas_Bokulich · September 13, 2018, 5:17pm

Hi @suzukik,
Usually this issue is caused by using the wrong database for the sequences, e.g., using an ITS1 database on ITS2 sequences.

I am assuming that you have deeper taxonomic assignments for the sequences that did classify — the barplots you show are level 2, but I am assuming that you have, e.g., species level for some sequences. If not, there is something wrong with your classifier or input sequences.

A few ideas, in order of likelihood:

You are using the standard UNITE database, which is trimmed to the ITS itself (i.e., removes all flanking rRNA gene fragments). Depending on your primers and sequence length, this could be a problem. Use the "developer" version of the UNITE (included in the standard release in a subdirectory), which contains the untrimmed sequences.
the unclassified sequences are non-target DNA, e.g., plant or eukaryote ITS sequences, that is not in the UNITE database. Since you mention you blasted several of these, I'm assuming they came up as fungal species, but this is still a possibility.
Are your sequences in mixed orientations? That could also be an issue here.

As a sanity check, you can also try one of the alignment-based classifiers in q2-feature-classifier to see if you get better classifications. If you get similar results (lots of unassigned), there is probably an issue with the database or your sequences (e.g., #1 or #2 above). If you get better results (few unassigned), it is probably an issue specific to the classifier (e.g., #3 above) and we can help debug further.

Let us know what you find! Thank you!

suzukik · September 16, 2018, 12:30pm

Hi Nicholas_Bokulich,
Thank you very much for your comments and suggestions!
I used ITS1-F_KYO1 and ITS2_KYO1 (Toju et al., 2012) targeting ITS1 region, and our sample is soil DNA, so there is a possibility to contain plant root, protozoan and bacterial DNA.
And our barplots contain also genus and species taxonomy. Therefore, we hope that the problem is in reference data.

We want to try to use developer version of the UNITE database, however, we could not find that.
Could you tell me more detail of developer version of the UNITE?

We are so sorry because we are beginner for fungal sequence. We have experience only for 16S bacterial DNA...
We are really looking forward to hearing from you.

Best regards,
Kazuki

Nicholas_Bokulich · September 17, 2018, 1:46pm

Hi @suzukik,

I see — so many of those unclassified sequences could just be non-target (I am not familiar with the primer sequences though — they may be fungi-specific).

You could do one of two things to remove non-target DNA from your data:

Add a few non-fungal ITS sequences to your database, so that those sequences are assigned. This can be a small number, e.g., just a few from each phylum to get a rough classification so that you can remove these, or it can be a larger number if you are interested in seeing what types of non-fungal sequences are present. You can then filter your feature table and sequences to exclude non-fungal taxa.
Use the same non-fungal sequences to assemble a non-fungal ITS reference database, which you can use to exclude non-fungal sequences by filtering prior to classification.

All of the QIIME-compatible UNITE releases, e.g., the latest release (I just downloaded the latest to make sure ) , contain the following contents:

a readme
3 fasta files (different OTU clustering thresholds)
3 taxonomy files (paired with the fasta files)
a "developer" directory with 3 more fastas and taxonomy files — the same OTU clustering thresholds, but using untrimmed sequences.

If you are seeing something else, please make sure you are using one of the QIIME-compatible releases.

By the way, you probably saw this note in the classifier training tutorial but just want to make sure (since I was reminded this morning that this is an important step for training a UNITE ITS classifier).

Not a problem at all! We are here to help

I hope this advice is helpful to you!

suzukik · September 19, 2018, 12:11pm

Hi NICHOLAS_BOKULICH,

Thank you very much for your support. And sorry for my late replying.
We used developer version of UNITE you suggested.
The results are below (phylum and genus level).

It seems better than previous one.
However, now, most phylum were assigned to Mortierellomycota. And this is assigned as single genera, not diverse genus.
I am not sure whether this is true or not, and I don't know how to clarify this problem.
Do you have any knowledge?

Thank you very much for your help.
I am looking forward to hearing from you.

Nicholas_Bokulich · September 19, 2018, 2:59pm

Hi @suzukik,
Great! This is looking much better.

Does that classification make sense, given your sample types? I don't know anything about that genus and have not done any ITS analysis in soils, but judging from the wikipedia entry on Mortierella, it seems feasible.

It could help to get a second opinion. You could try classifying with a different classifier (e.g., try the classify-consensus-blast method in q2-feature-classifier), or you could use NCBI blast to see what else it is similar to. For NCBI blast, do the following (as described in the metadata tutorial):

qiime metadata tabulate \
    --m-input-file taxonomy.qza \
    --m-input-file sequences.qza \
    --o-visualization taxonomy.qzv

You can then use that visualization to sort by or search for sequences assigned to Mortierella (or other taxa), and copy/paste the sequences into NCBI blast.

NCBI blast may have some different ideas... but as long as it is fairly close to that genus I would not worry (NCBI blast against the default database is also full of junk, improperly curated sequences, etc, so I would personally trust your classification more than blast). If it is wildly different, you could also look at the Mortierella sequences in your reference sequences to make sure they look normal — we have seen issues where excessively short sequences have caused classification problems, but that does not really fit the profile here.

I hope that helps!

suzukik · September 21, 2018, 12:17am

Hi @Nicholas_Bokulich,

Thank you very much for the suggestions!
I forgot the command qiime metadata tabulate. Yes, this command can answer my question.
I checked the sequences related to Mortierella, and finally I understand that the our taxonomy classification can be trusted.
As you suggested, there are many uncultured fungus sequence but Mortierella sequences are also reported in NCBI BLAST.

Again, thank you very much for your kind support!!

system · October 22, 2018, 6:17am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.