I have a question about the interpretation of taxonomy strings that result from running the consensus blast. For example running classify_consensus_blast gives an output like the following:
In both cases there is an assignment to the order level, however, the first taxonomic string indicates that the family level was unobtainable above our confidence threshold, while the second string indicates a confidence of 1.0 that it was classified at the order level with all of the hits unclassified at the family, genus, and species level. In this way, a ‘unclassified’ blank genus and species can yield a 100% confidence if indeed all of the taxonomies at that level are blank.
Are these differences material? Would we interpret our confidence in the first string differently than on the second string?
Hi John,
What you are observing is a peculiarity of the greengenes database, not of classify_consensus_blast. Your interpretation of the first sequence is correct (more notes on “confidence” below). However, the second sequence is actually being classified to species level by classify_consensus_blast — however, it is indicating that the top matches are all to greengenes reference sequences that are annotated in greengenes with empty family, genus, and species annotations. Greengenes contains a number of annotations such as this, with empty taxonomic levels wherever taxonomic affiliation could not be fully resolved. Needless to say, this creates confusion when sequences are assigned these ambiguous taxonomies! If classify_consensus_blast failed to find a consensus taxonomy for this sequence above order level, the output would instead be: a6837f53649dd3ec008d38c528d43aa7 k__Bacteria; p__Chloroflexi; c__Anaerolineae; o__GCA004
In the results generated by classify_consensus_blast, “Confidence” is the fraction of top hits that match the consensus taxonomy (at whatever level is provided), so this indicates that 80% of top hits matched k__Bacteria; p__Verrucomicrobia; c__[Pedosphaerae]; o__[Pedosphaerales] for the first sequence, and 100% matched k__Bacteria; p__Chloroflexi; c__Anaerolineae; o__GCA004; f__; g__; s__ for the second sequence. Consensus is determined at each taxonomic level, descending from kingdom, and stopping when consensus is no longer met above the threshold minimum consensus value; the taxonomy is trimmed at this point. So confidence is interpreted the same for all assignments, but at different taxonomic levels.
Please let me know if you have any more questions or concerns. Thanks!
Thanks Nick, this is pretty much inline with my understanding of what was happening. My question was not so much in regards to how the classify_consensus_blast was arriving at the result, but rather how to interpret the meaning of the result. It may be more of a question for the green genes maintainers than for the developers of the classifier.
Ultimately what I am concerned with is understanding how to treat the taxonomy strings in an analysis. I can’t think of a situation in which it would be inappropriate to truncate assignments such as: k__Bacteria; p__Chloroflexi; c__Anaerolineae; o__GCA004; f__; g__; s__
to: k__Bacteria; p__Chloroflexi;c__Anaerolineae
and then treat it the same as something that was only assigned to the class level in the first place, but my understanding may be off.
Thanks John. I think your question is getting at the core of why it is problematic to provide ambiguous annotations in the first place — it is a quandary to interpret and, worse, present these data to others.
I think that your solution is useful from a practical standpoint, and there is nothing inappropriate with collapsing this taxon into a class-level Anaerolineae taxon. It would, however, confuse many people if you are presenting a class-level taxon side-by-side with, say, genus-level taxa, e.g., in a barplot.
From an analytical perspective, preserving the raw assignments may be informative, e.g., if k__Bacteria; p__Chloroflexi; c__Anaerolineae; o__GCA004; f__; g__; s__ but not k__Bacteria; p__Chloroflexi; c__Anaerolineae differentiates your experimental groups — if you do find such situations, you could always re-classify this sequence using a different reference database to see if you achieve better resolution.
From a classification perspective, it is important to distinguish between a classification that could be confidently made to species level vs. one that could only be made to class level. In the former, it indicates that you have a close match to a previously described organism present in the reference database — even if that organism’s taxonomy is not fully resolved (e.g., in the case of uncultured organisms, as I think you may be dealing with). In the latter, you have a sequence that matches the reference database very poorly, and could indicate A) a potentially novel organism (yay), B) an error-ridden sequence or chimera (boo), C) that you’re using the wrong reference database!