Poor taxonomy classification on species level

biojack · May 13, 2022, 6:31am

Hi

I'm using SILVA 138.1 to construct taxonomy for biome 16S analysis. I got NR99 version and used some additional filters ( remove uncultured and unidentified taxons, also taxons which not represented at species level )

I used next commands to create classificator on V4 regions

qiime feature-classifier extract-reads  --i-sequences Silva_NR99_filter.qza  --p-f-primer GTGCCAGCMGCCGCGGTAA --p-r-primer GGACTACHVGGGTWTCTAAT  --p-trunc-len 250 --p-min-length 50 --p-max-length 450 --o-reads Silva_NR99_filter_V4.qza

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads Silva_NR99_filter_V4.qza --i-reference-taxonomy Silva_tax.qza --o-classifier classifier_Silva_V4.qza

then after running qiime2 pipeline I got not very good results. At species level I have 186/341 classified (54%). But on genera level it much better - 170/200 (85%) - but also should be a little better on practice

I tried to undetstand why I have such bad results on species level. Particularly I considered "Akkermansia" genera. So in fact my pipeline didn't recognize "Akkermansia muciniphila" but recognize only "Akkermansia".

I tried to keep only one strain "Akkermansia muciniphila" from Silva 138.1 Db. With such settings pipeline was done fine and I got "Akkermansia muciniphila" in results. But when I kept both "Akkermansia muciniphila" and "Akkermansia muciniphila ATCC BAA-835", the classification gave me only "Akkermansia" genera.

So it seems that such recognition problem is related to redundancy in some sence in sequences used for taxonomy. Also it looks that there could be some solution for reduce this redundancy - for example like removing all species except "Akkermansia muciniphila" - because "Akkermansia muciniphila" in results instead "Akkermansia muciniphila ATCC BAA-835" is much better than just "Akkermansia" on genera level

Do you have some suggestions how could such results be improved? IMO improving to 75-80% recognition on species level will be adequate result

Thank you for your attention : )

Nicholas_Bokulich · May 13, 2022, 6:59am

Hi @biojack ,

The V4 region of 16S cannot reliably differentiate most species, so this result is quite expected. Also at genus level 85-90% classification is normal. This is because of the limited information content of the short hypervariable regions vs. full 16S. For more prose and data on this, see the previous work on this starting with RDP's benchmarks, and then our own many benchmarks for the classifiers used in QIIME 2:
https://journals.asm.org/doi/10.1128/AEM.00062-07#F1

https://www.nature.com/articles/s41467-019-12669-6

Note that SILVA does not curate their taxonomy at species level. They take the info directly from NCBI. So a very large proportion are missing or do not match the genus label (for data and more discussion of this and related topics, see the next paper linked below). This is also why you see annotations with strain labels like this, because that is the raw label from the GenBank accession:

Quick note on Akkermansia: this is a good example of how database "noise" does create issues (more on this below), but Akkermansia is also a bit of a poor representative for other more populated clades in the database, since the genus only have a couple species (and I think only A. muciniphila is in SILVA if I recall correctly). Other genera have genuine problems resolving at species level because those species have very similar or identical 16S sequences (esp. when looking at only short amplicons).

Yes. We released a QIIME 2 plugin a little while ago that will let you create such databases and perform various filtering steps (and record all of these edits in QIIME 2's provenance for full traceability of what you did, so you can easily do it again ). This includes a method to edit-taxonomy, which you could use with a regular expression to trim off strain information... or filter-taxa that are missing information at specific ranks, etc. You can find a tutorial for using this with SILVA here:

We used RESCRIPt to do quite a few database and filtering benchmarks for various marker genes and dbs. You can see that these edits lead to ~75% classification accuracy at species level with SILVA. Removing redundant/confusing/mis/unannotated entries from the database really does help, and that was one original motivation for this plugin.

This is based on cross-validated classification of the known reference sequences, so the 75% classification accuracy here should not be interpreted as "this method is only accuracy 75% of the time". Rather, this is telling you how much species-level resolution you could possibly expect when studying a community with mostly well-characterized microbial composition. Mileage may vary, as some systems contain many more uncharacterized microbes than others (and hence may be poorly represented by a given database). QIIME 2's classifiers account for this by using confidence intervals to prevent the classifier from giving a species (or genus)-level classification when there are multiple possible hits at that level. So the low number of species-level hits in a real database can be related to noisy databases or underperforming methods for sure, but at least 20-25% of the lack of resolution is due to the inability to resolve individual species using very short DNA segments.

But there are other approaches that can improve this, even up to around 90% classification accuracy at species level... see the benchmarks and papers above for mcuh more description of the various methodological approaches and limitations that we have explored.

Good luck!

biojack · May 13, 2022, 8:14am

Thank you @Nicholas_Bokulich very much !

I know about RESCRIPt, I tried once to use id dereplication step, but it was not give notable effect. But I will try to learn it more preciesly.

I will try your suggestions and return with new results

Also, if Silva is not curated species level taxonomy - would be better to use some another database for species recognition - GTDB or smth else?

Nicholas_Bokulich · May 13, 2022, 10:55am

yes the impact of dereplication on accuracy is negligible, but it should improve performance (speed). Other steps, e.g., using RESCRIPt to edit the taxonomy to remove strain labels, would help a little more for specific clades.

Yes, but at a cost.

GTDB is more computer-digestible (i.e., better ability to resolve clades) in the sense that they have already worked to disentangle polyphyletic clades. But this is by introducing unofficial taxonomic labels (bypassing the usual formal steps for naming new clades). So it comes with theoretical and practical costs: at the very least you (and collaborators, etc) might find it hard to interpret the results and compare to the literature using other (e.g., the official) taxonomic naming schemes. I like GTDB but it gives some people indigestion, so these are costs to weigh.

Also, GTDB does not totally "solve" this issue, because V4 is still a short sub-region of the 16S with limited information content, so you can still only resolve species 75-80% of the time (even in GTDB). If you really want something radically different, see the third paper I linked above (Kaehler et al. 2019), this brings the accuracy rate to around 90% at species level, but it is a more complicated approach.

If you want some examples, you can see those benchmarks in the RESCRIPt paper, comparing GTDB, SILVA, NCBI Refseqs, and Greengenes (on both simulated and real data).

biojack · May 23, 2022, 12:13pm

@Nicholas_Bokulich hi

I studied your references and provided some analysis on ReSCRIPT tool

I tried Silva first and ran totally again ReSCRIPT as in instruction. I do not use dereplication step and "edit-taxonomy" step. Then I also filtered out all 'Uncultured' and 'Unidentified' species and built classifier. On qiime analysis then I got 235/375 (63%) species that's better than my privious results. So it seems that main advantage was in reducing strain name to species name and it gave benefit

Then I used GTDB database from official site and with ReSCRIPT filters I got 298/450 (66%) that even better than Silva. I should say that GTDB files preliminary filtered and there no 'uncultured' or 'unidentified' species

Last, I tried RefSeq + RDP database taken from zenodo.org resource. It looks promising but after analysis I got only 176/276 (64%) species. Percent is ok, but total recognized species not good.

So to summarize my current results:

Silva: 235/375 (63%)
GTDB: 298/450 (66%)
RefSeq: 176/276 (64%)

Also I wouldlike to ask some additional questions:

do you think that some other databases could be useful for taxonomy building and could improve results above ( databases like Unite or GreenGenes )
is there some tools to use edit-taxonomy of ReSCRIPT in authomatic way? I mean maybe some input file with renaming of taxons. Escherichia-Shigella vs Salmonella was clean example but I guess that there much more taxons that should be corrected. There could be a months of routine work. But I guess that someone probably already done such correction
I should also note that in all my investigations I done mapping of taxonomy to HMP database. I did it because I need only species which could live in human. But such mapping could give some losings. For example I lost Hungathella_A from GTDB though some species could live in human as I know. So question is - could there be a more reliable rule to keep only human biome bacteria from taxonomy?

Nicholas_Bokulich · May 23, 2022, 3:05pm

Hi @biojack

It is not too surprising that these all give quite similar results, even after some cleanup. This suggests that the lack of species-level classification is probably due to resolution issues rather than database noise (though clearly that is a small component).

No these are the main/leading 16S databases out there. UNITE is for fungal ITS. Greengenes was last updated 9 years ago (some annotations are out-of-date) and even many of the reference sequences are not annotated at species level, specifically because of the issue of low resolution of the 16S (i.e., in greengenes if the 16S sequence is identical or nearly matching between 2 species, the species annotation is removed).

There is really no way to escape the fact that short variable regions of the 16S have limited resolving power.

Yes, either by inputting a regular expression (regex) to edit based on pattern matching, or you can input a "replacement map" TSV file containing as many edits as you want to perform in a single pass. See the help docs (qiime rescript edit-taxonomy --help) for more details.

Yeah I would recommend against this approach, since it makes some assumptions that can lead to some potentially invalid results (e.g., it assumes that HMP is representative of all human populations; also, contaminants that are not usually found in humans could map to related species that are found in humans).

You might want to check out this paper that I linked to above. This would allow you to weight the classification of microbes already found in humans, without ruling out the possibility of contaminants or surprises.

Good luck!

biojack · May 24, 2022, 8:20am

Hi, @Nicholas_Bokulich Thank you for answer

I mean probably someone already constrcucted such TSV "replacement map"? Because it seems not very fast task to manually create such file from beginning. Because there could be a lot of issues. It would be very helpful at least if such file would be just a hint how most of cases could be solved.

I'm not sure that I correctly got it. My task is in taking from taxonomy (for example GTDB) only such species which could met in human biome. So I need the rule how I could correctly get that subset from GTDB. You suggested to weight human bacteria but I need list of such bacteria to know what to weight. Or Maybe you meant that I need to weight bacteria successfully mapped on HMP with "good" weight and rest with "bad" weight ?

Nicholas_Bokulich · May 24, 2022, 9:09am

I am not aware of any such document. This would be highly customized and person/project specific.

I suggest reading that paper and the associated "clawback" tutorial on this forum for more details on how to do this, if this is something that you want to try.

Subsetting only specific taxa, or manually specifying these weights is NOT something that I would recommend. So follow the tutorial to see how this can be done for different sample types.

system · June 24, 2022, 3:09pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.