In my Biom table, I have some unclassified bacterial taxa (k__Bacteria), I am wondering if this is normal? I also blast some of them and they had taxa information in NCBI (I tried few of them) Here is what I ran to get the table. Please also see the attached file.
More details about my taxa information: I have 6930 (with 16,000,000 million reads) OTUS (ASV) and 400 (2,635,985) of them have only k__Bacteria (almost 16%). I also blast some the sequence for k__Bacteria (please see the picture). Also, for reference I used gg_13_8_otus. Some of NCBI bast hits are not even bacteria.
The resolution of the screen-shots is a bit low and I am unable to read the output. But I assume some are Eukaryota? I would sanity-check this by classifying against the SILVA 138 reference sequences, as this is a larger reference set that also includes eukaryotes. The classifiers, and the files used to make them are located here.
If you are interested in making your own SILVA reference files, consider making use of RESCRIPt.
Thank you so much for your reply. For resolution, if you zoom in I think it will be better. Yes some of them are Eukaryota and I already deleted the mitochondria and chloroplast, and I am not interested in Eukaryota either, but some of them are bacteria (like unculturable bacteria). My question is if they are Eukaryota then why it says bacteria and why qiime did not filter them before? Should I be worried about these taxa or simply I can filter them like mitochondria and chloroplast? Also, the number of reads for these taxa is a little bit high (almost 16%).
The issue is that Greengenes does not contain any eukaryota reference sequences. So… the classifier may get confused and consider anything it cannot “figure out” (e.g. eukaryotes, viruses, fungi, whatever) as an unknown bacteria, unknown archaea, or unclassified sequence. A good reference database should have outgroup / decoy / off-target taxa so that it can properly inform you if those sequences are not bacteria / archaea. Then you can filter out any unwanted features.
Even if you are not interested in these groups, you should always make sure you have outgroup taxa in your reference database (e.g. eukaryota). I recommend using SILVA for this reason. Also, SILVA is more up-to-date and contains far more bacterial and archaeal reference sequences as well. One last bit, you’ll obtain far better taxonomic resolution using non or minimally clustered (i.e. 99% - 100%) reference databases as opposed to 97%. Before you remove these sequences, I suggest that you re-classify with SILVA. Then you can filter out anything you do not need for your study, like the plastid sequences.
So I ran these commands and it took almost one day to finish it (Please see below). I downloaded these reference seq and taxa files as my reference files (see below). And the taxa information is almost same but SILVA was better. Still I have 150 of k__Bacteria (it was near to 400 before) in my table and the remaining (250) are classified as either as Eukaryota or Archaea. And yes there is more taxa information than before.
Hi @Javad32, it’d be better to see the barpot for this. But Looks like it worked really well. Considering the amount of data you have, that many ‘unclassified’ reads is not uncommon. Some appear to have poor BLAST scores anyway, likely meaning they are simply poor reads.
Depending on what you’re after you may or may not want to filter those. Quite often, I remove any data that does not have at lease a phylum-level taxonomy.