Hello i have been analyzing the TB sputum samples. I have trained the nb classifier on both Silva (silva_132_97_16S.fna ) and Green genes (85_otus.fasta) and used each of these to classify my samples although i get unto 97% classification at Kingdom level, i barely reach 5% classification at Phylum and below.
Hi @Adrian_Muwonge,
Sorry to hear you are having trouble! 99% of the time when classification is only achieved at kingdom level, it is due to human error (e.g., using the wrong reference database for a given set of query sequences), and the other 1% of the time it is caused by technical error (usually the sequences are in mixed orientation or something is wrong with the reference sequences, causing the classifier to become very confused).
So I apologize but I have to ask some very very silly questions:
Since you are using SILVA and greengenes I am guessing that you have 16S data but just want to make sure.
Are GTGYCAGCMGCCGCGGTAA and GGACTACNVGGGTWTCTAAT the primers that you used to amplify your sequences? Did you truncate your sequences to 120 nt when you denoised your sequences? Because I notice that your extract-reads command is suspiciously similar to the one in the tutorial. Just want to make sure you are using the correct primers and not just plugging in the exact commands from the tutorial.
Why are you using the 85_otus.fasta? Did you see the note in this section of the tutorial? That file will not give useful classifications and is only used for demonstration purposes in the tutorial.
Maybe you can share the Classify_taxa-bar-plots.qzv file so we can take a look?
Hello Nicholas, indeed i am more than certain that it is human error, but iam trying to figure out where that error is coming from. Yes i am dealing with 16s data, i was trying to use both databases to see if the is a difference in these two data bases. At this point it is exploratory just to ground my self on how to do this. This would explain why the code is suspiciously similar to the tutorial. The primers are the ones used, the demonising was trancated at 220, so this could be the reason.. i will sort this and rerun the analysis. See the file requested
Hello Nicholas, there is an improvement but not a massive one, i still have about 90% not classified beyond kingdom. The attached files are generated when i classify using SILVA. I will try again with GREEN GENES and get back to you. Let me know what you think
Thanks for sharing. Indeed, that really does not look too good.
As a test, could you also try one or more of the pre-trained classifiers? These should work fine for your data.
You could also try classify-consensus-vsearch instead to get a "second opinion" (this will help narrow down if this is an issue with the sequences or with the classifier)
Are your data by any chance mixed-orientation reads? The classify-sklearn classifier cannot handle those, currently, and will result in poor classification such as you are seeing.
Nicholas, even with classify-consensus-vsearch, the picture is similar, of course here we have quite a lot unclassified. I will try the pre-trained classifier, before i start investigating the orientation of the reads..
If vsearch gives the same classifications, then read orientation is not the issue.
It also looks like you used full-length 16S as a reference for vsearch? (if you did, good — then trimming is not the issue either). If so, the pre-trained classifiers will not help either.
At this point I think the most likely possibilities are:
You have a large proportion of non-target DNA (e.g., host DNA)
There is something wrong with your query sequences (e.g., adapters or other non-biological DNA in the sequences).
Check out this post, particularly the part about checking some of the unassigned sequences with NCBI blast... that will help figure out if you have some non-target DNA.
Hello Nicholas, like you said in your first sentence 99% of these things are usually human error, i found the error, the trimming i had originally done with dada2 denoise was trimming 20bp more from the forward read --p-trunc-len-f 220 \ --p-trunc-len-r 200 \ and looking at the QC it had better seq quality...(so i was trimming more the good stuff). I have rerun this using the code below and i can now map 99% of all my reads to taxa as low as level 7.
Very interesting... truncation length usually does not cause such dramatic differences (at least not a 20 nt difference, maybe a 200nt difference!). Since you have paired-end data, I suspect what occurred here is the difference in trunc length may have impacted the quality of the paired-read alignments, resulting in poor classification.