problem with taxa summary (getting to many k__Bacteria)

SoilRotifer · August 8, 2020, 9:06pm

The issue is that Greengenes does not contain any eukaryota reference sequences. So... the classifier may get confused and consider anything it cannot "figure out" (e.g. eukaryotes, viruses, fungi, whatever) as an unknown bacteria, unknown archaea, or unclassified sequence. A good reference database should have outgroup / decoy / off-target taxa so that it can properly inform you if those sequences are not bacteria / archaea. Then you can filter out any unwanted features.

Even if you are not interested in these groups, you should always make sure you have outgroup taxa in your reference database (e.g. eukaryota). I recommend using SILVA for this reason. Also, SILVA is more up-to-date and contains far more bacterial and archaeal reference sequences as well. One last bit, you'll obtain far better taxonomic resolution using non or minimally clustered (i.e. 99% - 100%) reference databases as opposed to 97%. Before you remove these sequences, I suggest that you re-classify with SILVA. Then you can filter out anything you do not need for your study, like the plastid sequences.

-Best wishes.
-Mike