Greengenes Repeated Classifications


Since the related thread (Repeated classifications in the Greengenes taxonomy file) was closed already, I have a few follow-up questions along the same lines. I have a handful of sequences that were only classified down to the “Bacteria” level of taxonomy. If I understood the response to the previous post, this is most likely due to inaccurate source annotations or the sequences had errors or chimeras.

I would be much obliged if the following questions could be addressed:

1.) Is it a safe assumption that those only classified as “Bacteria” are closely related bacteria, or because the taxonomic level is so large, that cannot be assumed? Would sequences classified down to lower taxonomic levels (say the family level, for example) be more safely assumed as more closely related?

2.) When exported as a CSV, the data are condensed, so all of the sequences labelled as “Bacteria” are combined into one unit. Is there any way of analyzing such a classification? For example, one option I tried was running the highest identity sequence through BLAST and using the resulting organism as a proxy for the “Bacteria” sequences. I’m not sure if this is viable, however. Would the highest identity “Bacteria” be the best representative of all sequences classified as “Bacteria”?

Thank you in advance! Let me know if these questions make sense. @wasade

1 Like

It sounds like your question is different. The post you linked to was regarding replicate taxonomic annotations in the reference database, whereas it sounds like you are interested in replicate taxonomic classifications on unknown sequences.

Taxonomic replicates in reference sequences that have been clustered into OTUs are due to misannotation, low-level chimera/error, or the fact that 97% similarity does not necessarily correspond to species differences.

Replicates in classification are not related to this.

There should be no assumption that these are closely related. Anything classified only to kingdom level is most likely non-target DNA (NCBI BLAST can be used to check this) and should be removed.

Yes. But even so, I would not assume that they are closely related unless if:

  1. they have the same species classification
  2. you build a dendrogram to show how closely related they are.

This only happens if exporting barplot data as a TSV. You could use metadata tabulate or export to biom and merge taxonomy to keep these features separate.

Use metadata merge (see link above) to merge taxonomy info and sequence data for each feature. This will be much more useful for examining individual sequences, e.g., to cross-check against NCBI BLAST.

Not viable, because these sequences are not necessarily related to each other.

If you are getting large numbers of sequences that are only assigned as “Bacteria”, re-running these through BLAST is probably NOT a good solution. Most likely (in order of likelihood):

  1. These are non-target DNA, e.g., host DNA.
  2. You are not using the correct reference database.
  3. (much less likely) you need to adjust your taxonomy classifier parameters.

I would bet on #1 and using metadata merge and then NCBI BLASTing a few of these can help confirm. Otherwise check to make sure you are using the correct reference database (e.g., don’t use V4 sequences to classify V3 sequences!).

I hope that helps!


Wow, thank you very much for the thorough answers! I’ll work on this and get back with how successful it was.

Best wishes!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.