Greengenes Repeated Classifications

Nicholas_Bokulich · June 13, 2018, 4:46pm

It sounds like your question is different. The post you linked to was regarding replicate taxonomic annotations in the reference database, whereas it sounds like you are interested in replicate taxonomic classifications on unknown sequences.

Taxonomic replicates in reference sequences that have been clustered into OTUs are due to misannotation, low-level chimera/error, or the fact that 97% similarity does not necessarily correspond to species differences.

Replicates in classification are not related to this.

There should be no assumption that these are closely related. Anything classified only to kingdom level is most likely non-target DNA (NCBI BLAST can be used to check this) and should be removed.

Yes. But even so, I would not assume that they are closely related unless if:

they have the same species classification
you build a dendrogram to show how closely related they are.

This only happens if exporting barplot data as a TSV. You could use metadata tabulate or export to biom and merge taxonomy to keep these features separate.

Use metadata merge (see link above) to merge taxonomy info and sequence data for each feature. This will be much more useful for examining individual sequences, e.g., to cross-check against NCBI BLAST.

Not viable, because these sequences are not necessarily related to each other.

If you are getting large numbers of sequences that are only assigned as "Bacteria", re-running these through BLAST is probably NOT a good solution. Most likely (in order of likelihood):

These are non-target DNA, e.g., host DNA.
You are not using the correct reference database.
(much less likely) you need to adjust your taxonomy classifier parameters.

I would bet on #1 and using metadata merge and then NCBI BLASTing a few of these can help confirm. Otherwise check to make sure you are using the correct reference database (e.g., don't use V4 sequences to classify V3 sequences!).

I hope that helps!