problem with taxa summary (getting to many k__Bacteria)

Hello All

In my Biom table, I have some unclassified bacterial taxa (k__Bacteria), I am wondering if this is normal? I also blast some of them and they had taxa information in NCBI (I tried few of them) Here is what I ran to get the table. Please also see the attached file.


Kind regards,

Javad

16- qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat --input-path 99_otu_taxonomy.txt --output-path ref-taxonomy.qza

17- qiime feature-classifier extract-reads --i-sequences 99_otus.qza --p-f-primer ATTAGATACCCNGGTAG --p-r-primer CGACAGCCATGCANCACCT --p-trunc-len 280 --p-min-length 100 --p-max-length 400 --o-reads ref-seqs.qza

18- qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads ref-seqs.qza --i-reference-taxonomy ref-taxonomy.qza --o-classifier classifier.qza

19- qiime feature-classifier classify-sklearn --i-classifier classifier.qza --i-reads rep-seqs.qza --o-classification taxonomy.qza

20- qiime metadata tabulate --m-input-file taxonomy.qza --o-visualization taxonomy.qzv

21- qiime taxa barplot --i-table table.qza --i-taxonomy taxonomy.qza --m-metadata-file sample-metadata.tsv --o-visualization taxa-bar-plots.qzv

22- qiime tools export --input-path table.qza --output-path exported

23- qiime tools export --input-path taxonomy.qza --output-path exported

24- cp exported/taxonomy.tsv biom-taxonomy.tsv

25- Open the biom-taxonomy.tsv and change the first line of biom-taxonomy.tsv (i.e. the header) to this:

#OTUID taxonomy confidence

26- biom add-metadata -i exported/feature-table.biom -o table-with-taxonomy.biom --observation-metadata-fp biom-taxonomy.tsv --sc-separated taxonomy

27- biom convert -i table.biom -o table.from_biom_w_taxonomy.txt --to-tsv --header-key taxonomy

More details about my taxa information: I have 6930 (with 16,000,000 million reads) OTUS (ASV) and 400 (2,635,985) of them have only k__Bacteria (almost 16%). I also blast some the sequence for k__Bacteria (please see the picture). Also, for reference I used gg_13_8_otus. Some of NCBI bast hits are not even bacteria.

Hi @Javad32,

The resolution of the screen-shots is a bit low and I am unable to read the output. But I assume some are Eukaryota? I would sanity-check this by classifying against the SILVA 138 reference sequences, as this is a larger reference set that also includes eukaryotes. The classifiers, and the files used to make them are located here.

If you are interested in making your own SILVA reference files, consider making use of RESCRIPt.

-Best
-Mike

Thank you so much for your reply. For resolution, if you zoom in I think it will be better. Yes some of them are Eukaryota and I already deleted the mitochondria and chloroplast, and I am not interested in Eukaryota either, but some of them are bacteria (like unculturable bacteria). My question is if they are Eukaryota then why it says bacteria and why qiime did not filter them before? Should I be worried about these taxa or simply I can filter them like mitochondria and chloroplast? Also, the number of reads for these taxa is a little bit high (almost 16%).

Hi @Javad32,

The issue is that Greengenes does not contain any eukaryota reference sequences. So... the classifier may get confused and consider anything it cannot "figure out" (e.g. eukaryotes, viruses, fungi, whatever) as an unknown bacteria, unknown archaea, or unclassified sequence. A good reference database should have outgroup / decoy / off-target taxa so that it can properly inform you if those sequences are not bacteria / archaea. Then you can filter out any unwanted features.

Even if you are not interested in these groups, you should always make sure you have outgroup taxa in your reference database (e.g. eukaryota). I recommend using SILVA for this reason. Also, SILVA is more up-to-date and contains far more bacterial and archaeal reference sequences as well. One last bit, you'll obtain far better taxonomic resolution using non or minimally clustered (i.e. 99% - 100%) reference databases as opposed to 97%. Before you remove these sequences, I suggest that you re-classify with SILVA. Then you can filter out anything you do not need for your study, like the plastid sequences.

-Best wishes.
-Mike

1 Like

Dear Mike,

So I ran these commands and it took almost one day to finish it (Please see below). I downloaded these reference seq and taxa files as my reference files (see below). And the taxa information is almost same but SILVA was better. Still I have 150 of k__Bacteria (it was near to 400 before) in my table and the remaining (250) are classified as either as Eukaryota or Archaea. And yes there is more taxa information than before.
Thank you,

Javad

https://data.qiime2.org/2020.6/common/silva-138-99-seqs.qza
https://data.qiime2.org/2020.6/common/silva-138-99-tax.qza

qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads ref-seqs.qza
–i-reference-taxonomy ref-taxonomy.qza
–o-classifier classifier.qza

qiime feature-classifier classify-sklearn
–i-classifier classifier.qza
–i-reads rep-seqs.qza
–o-classification taxonomy.qza

qiime metadata tabulate
–m-input-file taxonomy.qza
–o-visualization taxonomy.qzv

Hi @Javad32, it’d be better to see the barpot for this. But Looks like it worked really well. Considering the amount of data you have, that many ‘unclassified’ reads is not uncommon. Some appear to have poor BLAST scores anyway, likely meaning they are simply poor reads.

Depending on what you’re after you may or may not want to filter those. Quite often, I remove any data that does not have at lease a phylum-level taxonomy.

-Mike

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.