high proportion of unclassified_Fungi with classifier trained on UNITE database

Merijn_Lamers · May 1, 2023, 2:08pm

I do not know if anyone still works with this, I tried to classify my ITS fungal reads with this method, but it turned out that there are only three organisms pressent according to this method:
k__Fungi;p__unclassified_Fungi;c__unclassified_Fungi;o__unclassified_Fungi;f__unclassified_Fungi;g__unclassified_Fungi;s__Fungi_sp;
k__Fungi
and:
k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Sordariomycetes_order_Incertae_sedis;f__Plectosphaerellaceae;g__Plectosphaerella;s__Plectosphaerella_sp_CCF3811;
over 95% belongs to the totally unclassified group, to me this doesn't sound correct, what did I do wrong?
used sh_refs_qiime_ver9_97_s_29.11.2022 as the database.

colinbrislawn · May 1, 2023, 3:44pm

Hello Merijn,

Welcome to the forums! :qiime2:

Did you include any positive controls with a known composition along with your other samples? We can use those as a point of comparison.

I guess it's also possible something is wrong with the database. You could try one of the unite databases I built and see if that works better.

Let me know what you find,
Colin

Merijn_Lamers · May 1, 2023, 4:13pm

Hi Colin,

Thank you for the response.

There is no positive control (at least not in vivo, can be added in silico perhaps)

As a small test I used the online tool of ncbi blastn to find out which organism were pressent, and there I found hits with high similarity (E=0.0 and 99+% simular) this are species that are expected to be pressent in the samples.

I will take a look at the database to which you refer.

Kind regards,
Merijn

Nicholas_Bokulich · May 1, 2023, 5:21pm

Hi @Merijn_Lamers ,

Only 3 different taxonomic groups is highly abnormal. I think think that there are a couple issues causing this, both due to the raw form of the database that you are using. Some filtering is needed to remove low-quality sequences that can skew classification with the naive Bayes classifier in q2-feature-classifier.

The "unclassified_fungi" are sequences found in the UNITE database (UNITE + INSDC dataset). Quite a few sequences are present that are not fully annotated (and called "unclassified_fungi" instead). I recommend filtering these out to improve classification accuracy by quite a reasonable degree, as we demonstrated in this paper:

You should also filter out any abnormally short sequences, as these have a tendency to skew classification results as well.

Good luck!

Merijn_Lamers · May 2, 2023, 5:11pm

Hi Collin,

I used your solver, which was beter, but still only a few OTUs were classified...
I think this might has something to do with the fact that I used primers that sequenced the complete ITS region and some of my consensus reads are over 3.5 kb long.

But I will keep on trying!

colinbrislawn · May 2, 2023, 6:36pm

Interesting. What sequencing technology did you use? Like, are these 3.5 kilobase pair long nanopore reads, or short Illumina reads assembled into the full-length ITS region?

Merijn_Lamers · May 3, 2023, 7:40am

We try to investigate the fungal community in about 50 samples.
To do so we used Nanopore and forward primer FUN18S1 (5’-CCATGCATGTCTAAGTWTAA-3’) (Lord et al., 2002) and reverse primer 28S R8 Deg R (5′-TTTCAAGACGGGTCGGTTRA-3′) (Lee, 2019).
Then these were then filtered (chimaera and length among other) and clustered at 97% --> OTUs created.
Now, of those OTUs I want to know who is there, so to say
I came across this method and am a little familiar with Qiime so thought this could be a solution.

Nicholas_Bokulich · May 3, 2023, 9:33am

Hi @Merijn_Lamers ,

Thanks for these details — this changes (almost) everything

Last time I checked UNITE contained mostly/all partial ITS sequences. The standard release is also trimmed to the ITS region I believe, i.e., the flanking SSU and LSU domains (where your primers are located) are not covered. So your amplicons are actually larger than the reference sequences (probably quite a lot larger).

Where exactly do FUN18S1 and 28S R8 Deg R bind? Do you amplify also significant portions of the SSU and LSU, or are these regions directly flanking the ITS?

So this is effectively a database problem (UNITE) or a method problem (classify-sklearn) or both, depending on how you look at it. UNITE does not cover your entire reads, and the classify-sklearn method relies on kmer profiles, which will look quite different when comparing ITS fragments (UNITE) vs. SSU+ITS+LSU.

Solutions:

You need a database that covers the entire region amplified. So UNITE might not be a good option (esp. if you have substantial portions of SSU and LSU in your reads). You could build this from NCBI perhaps (and RESCRIPt could be used to assemble this if you can identify suitable keywords).
You could try using one of the classify-consensus-* methods in q2-feature-classifier instead of classify-sklearn. These methods use alignment (blast local or vsearch global) and you can adjust the query coverage threshold to accept hits that do not fully cover your queries (because the reference seqs are shorter than the query!). You would need to figure out a suitably low coverage threshold, keeping in mind that the UNITE database contains many fragments that are only ITS1 or ITS2.

Option 2 is certainly the easier to try first.

Yes QIIME 2 should work, but it might take a little fiddling as the tutorials and default settings for various methods assume short-read sequencing, not long-read.

Please try out that solution and let me know what you find.

colinbrislawn · May 3, 2023, 3:39pm

Thank you for telling us more! This is a unique use case, which will be a little more challenging than the easy 16S V4 regions shown in many tutorials.

Take a look at Nick's excellent advice and let us know how this works for you. We are always interested in new primer sets and this thread is very helpful for future researchers!

Merijn_Lamers · May 5, 2023, 7:25am

Thank you both!

Yes I already found out that the UNITE database has some flaws for determining the large part that we amplified, a quick blast search with some of my sequences against the database yielded not very long overlapping hits.

So they start at the SSU and end in the LSU so that makes it that we have these long reads. and there are parts of the SU in the read.

Currently I am experimenting with some database that is created from the NCBI NT database. This seems promising. The problem is not that there isn't enough data there but more that there is a lot of poorly annotated or just simply wrong annotated data there, but that is as expected...

Yes I tried this to find out that the UNITE database is not the perfect match for what I am trying to do.

Funny you mention that, I figured something like that might be the case when I started looking around...

I think I can achieve some useable data making use of a larger database.

Thanks again!

system · June 5, 2023, 1:25pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.