Post analysis taxonomy results

jwdebelius · August 28, 2020, 3:07pm

You're seeing 2 things here. First, the greengenes database lacks lower level annotations (genus and spcies most frequently) for a lot of organisms. This partially due to limitations in our biological understanding and partially due to the information available to the classifier when the database was built. So, your send string,

Says the classifier you ran could find a genus, but the database doesn't have a genus name. (In greengenes, nameless genera are denoted g__;)

In your other example,

You couldn't classify the sequence past family level, meaning the genus isn't there.

Which I present this kind of data, I typically present the first as "unspecfied f. Oceanospirillales" and the second as "unclassified f. Oceanospirillales", but different people have different naming conventions. I'd keep all three sequences because there's something informative in all three of them. I tend to drop things that can't be classified or specified at higher levels - my cutoff is typically phylum - because there's a high probability that those are suprrious.

There are two options to improve your classification/ability to name. One option is to switch to Silva. This is a larger database, but it's newer. So, some of the things that are unnamed in greengenes may be named in Silva. You could also try the clawback plugin (tutorial here), which uses environment bespoke classifiers.

Im sure there are also several good threads that I'm not thinking of at the moment. You might check the "best of" tag and try searching for more information.

Best,
Justine