Poor taxonomic classification of sputum samples

Hello i have been analyzing the TB sputum samples. I have trained the nb classifier on both Silva (silva_132_97_16S.fna ) and Green genes (85_otus.fasta) and used each of these to classify my samples although i get unto 97% classification at Kingdom level, i barely reach 5% classification at Phylum and below.

here is the code used

qiime tools import --type ‘FeatureData[Sequence]’ --input-path SILVA_TAXONOMIC_DB/SILVA_132_QIIME_release/rep_set/rep_set_16S_only/97/silva_132_97_16S.fna --output-path silva_132_97_16S.qza

qiime tools import --type ‘FeatureData[Taxonomy]’ --source-format HeaderlessTSVTaxonomyFormat --input-path SILVA_TAXONOMIC_DB/SILVA_132_QIIME_release/taxonomy/16S_only/97/taxonomy_all_levels.txt --output-path ref-taxonomy.qza

Extract reads

qiime feature-classifier extract-reads --i-sequences silva_132_97_16S.qza --p-f-primer GTGYCAGCMGCCGCGGTAA --p-r-primer GGACTACNVGGGTWTCTAAT --p-trunc-len 120 --o-reads ref-seqs.qza

Train Classifier

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads ref-seqs.qza --i-reference-taxonomy ref-taxonomy.qza --o-classifier classifier.qza

Test the classifier

qiime feature-classifier classify-sklearn --i-classifier classifier.qza --i-reads denoise_output/representative_sequences.qza --o-classification Classify_ref-taxonomy.qza

qiime metadata tabulate --m-input-file Classify_ref-taxonomy.qza --o-visualization Classify_ref-taxonomy.qzv

Bar plot taxanomic classification

qiime taxa barplot --i-table table-dada2.qza --i-taxonomy Classify_ref-taxonomy.qza --m-metadata-file Kateetemetadata.txt --o-visualization Classify_taxa-bar-plots.qzv

Is there a way i can improve my classification?

Hi @Adrian_Muwonge,
Sorry to hear you are having trouble! 99% of the time when classification is only achieved at kingdom level, it is due to human error (e.g., using the wrong reference database for a given set of query sequences), and the other 1% of the time it is caused by technical error (usually the sequences are in mixed orientation or something is wrong with the reference sequences, causing the classifier to become very confused).

So I apologize but I have to ask some very very silly questions:

  1. Since you are using SILVA and greengenes I am guessing that you have 16S data but just want to make sure.
  2. Are GTGYCAGCMGCCGCGGTAA and GGACTACNVGGGTWTCTAAT the primers that you used to amplify your sequences? Did you truncate your sequences to 120 nt when you denoised your sequences? Because I notice that your extract-reads command is suspiciously similar to the one in the tutorial. Just want to make sure you are using the correct primers and not just plugging in the exact commands from the tutorial.

Why are you using the 85_otus.fasta? Did you see the note in this section of the tutorial? That file will not give useful classifications and is only used for demonstration purposes in the tutorial.

Maybe you can share the Classify_taxa-bar-plots.qzv file so we can take a look?


1 Like

Hello Nicholas, indeed i am more than certain that it is human error, but iam trying to figure out where that error is coming from. Yes i am dealing with 16s data, i was trying to use both databases to see if the is a difference in these two data bases. At this point it is exploratory just to ground my self on how to do this. This would explain why the code is suspiciously similar to the tutorial. The primers are the ones used, the demonising was trancated at 220, so this could be the reason… i will sort this and rerun the analysis. See the file requested

Thanks for confirming!

Again, sorry but those are the silly questions I need to ask to get started.

Yes, that is a very strong possibility. Re-run with appropriate trimming (or no trimming) and let us know if that fixes things!

Classify_taxa-bar-plots.qzv (470.3 KB)
Classify_ref-taxonomy.qzv (1.2 MB)

Hello Nicholas, there is an improvement but not a massive one, i still have about 90% not classified beyond kingdom. The attached files are generated when i classify using SILVA. I will try again with GREEN GENES and get back to you. Let me know what you think


1 Like

Thanks for sharing. Indeed, that really does not look too good.

As a test, could you also try one or more of the pre-trained classifiers? These should work fine for your data.

You could also try classify-consensus-vsearch instead to get a “second opinion” (this will help narrow down if this is an issue with the sequences or with the classifier)

Are your data by any chance mixed-orientation reads? The classify-sklearn classifier cannot handle those, currently, and will result in poor classification such as you are seeing.


vsearch_taxa-bar-plots.qzv (490.5 KB)

Nicholas, even with classify-consensus-vsearch, the picture is similar, of course here we have quite a lot unclassified. I will try the pre-trained classifier, before i start investigating the orientation of the reads..

thank you


Thank you for testing that.

If vsearch gives the same classifications, then read orientation is not the issue.

It also looks like you used full-length 16S as a reference for vsearch? (if you did, good — then trimming is not the issue either). If so, the pre-trained classifiers will not help either.

At this point I think the most likely possibilities are:

  1. You have a large proportion of non-target DNA (e.g., host DNA)
  2. There is something wrong with your query sequences (e.g., adapters or other non-biological DNA in the sequences).

Check out this post, particularly the part about checking some of the unassigned sequences with NCBI blast… that will help figure out if you have some non-target DNA.

Let us know what you find!

Hello Nicholas, like you said in your first sentence 99% of these things are usually human error, i found the error, the trimming i had originally done with dada2 denoise was trimming 20bp more from the forward read --p-trunc-len-f 220 \ --p-trunc-len-r 200 \ and looking at the QC it had better seq quality…(so i was trimming more the good stuff). I have rerun this using the code below and i can now map 99% of all my reads to taxa as low as level 7.

time qiime dada2 denoise-paired
–i-demultiplexed-seqs paired-end-demux.qza
–o-table table-dada2
–output-dir denoise_output
–p-trim-left-f 9
–p-trim-left-r 9
–p-trunc-len-f 220
–p-trunc-len-r 220
–p-n-threads 30
–p-n-reads-learn 200000

1 Like

Thank you for reporting back!

Very interesting… truncation length usually does not cause such dramatic differences (at least not a 20 nt difference, maybe a 200nt difference!). Since you have paired-end data, I suspect what occurred here is the difference in trunc length may have impacted the quality of the paired-read alignments, resulting in poor classification.

Very glad to hear you got this working!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.