Consensus blast taxonomy strings

Nicholas_Bokulich · May 11, 2017, 5:07pm

Hi John,
What you are observing is a peculiarity of the greengenes database, not of classify_consensus_blast. Your interpretation of the first sequence is correct (more notes on "confidence" below). However, the second sequence is actually being classified to species level by classify_consensus_blast — however, it is indicating that the top matches are all to greengenes reference sequences that are annotated in greengenes with empty family, genus, and species annotations. Greengenes contains a number of annotations such as this, with empty taxonomic levels wherever taxonomic affiliation could not be fully resolved. Needless to say, this creates confusion when sequences are assigned these ambiguous taxonomies! If classify_consensus_blast failed to find a consensus taxonomy for this sequence above order level, the output would instead be:
a6837f53649dd3ec008d38c528d43aa7 k__Bacteria; p__Chloroflexi; c__Anaerolineae; o__GCA004

In the results generated by classify_consensus_blast, "Confidence" is the fraction of top hits that match the consensus taxonomy (at whatever level is provided), so this indicates that 80% of top hits matched k__Bacteria; p__Verrucomicrobia; c__[Pedosphaerae]; o__[Pedosphaerales] for the first sequence, and 100% matched k__Bacteria; p__Chloroflexi; c__Anaerolineae; o__GCA004; f__; g__; s__ for the second sequence. Consensus is determined at each taxonomic level, descending from kingdom, and stopping when consensus is no longer met above the threshold minimum consensus value; the taxonomy is trimmed at this point. So confidence is interpreted the same for all assignments, but at different taxonomic levels.

Please let me know if you have any more questions or concerns. Thanks!

Nick