Hello,
I have created a Custom Database using SILVA, RDP, and UNITE sequences. In this,Blast_results.csv (3.7 KB) REpresentative_Seqs.txt (11.9 KB) the reads are filtered based on taxonomy and include only those which have full taxonomic nomenclature (till Family level at least). I was able to run the training classifier (for a number of days on my organization’s HPC) using qiime2 version 2020.2.0. Along with the combined DB (Archaea, Bacteria, and Fungi), I have created individual Bacteria and Fungi DB too.
For testing purposes, used a mock dataset of 16S and ITS sequences. For comparison purposes, I used SILVA and UNITE DB downloaded from their websites too.
Unfortunately, I find the SILVA and UNITE DB were able to identify all the genus present in the mock dataset. While the Custom database is sensitive enough to classify the exact kingdom, failed to correctly identify the correct genus and at the family level for few representative reads. The results were consistent in the Curated All DB (Archaea, Bacteria, and Fungi), only Bacteria DB, and only Fungi DB. For troubleshooting I did the following:
1: I verified, if these genus reads are present in my custom database (yes ample copies). Also, I have used SILVA reads(with good nomenclature), so they are the same which are present in the SILVA DB.
2: Re- trained the combined DB and tested on the mock dataset. Unfortunately, got the same results.
3: Created 2 Blast DBs, one using sequences of the combined (Archaea, Fungi, Bac) DB and the other using the Bacteria sequence DB and verified if I get correct hits for those reads which were wrongly identified using qiime classifier. I did get correct hits (see excel sheet)
4: Clustered the representative reads to see % identity of those reads which were wrongly classified. They were more than 90%.
I used the same qiime2 version for testing as was used for classifying. Yet SILVA and UNITE seem to provide better results.
Here are the commands I used for training the classifier:-
module load qiime2/2020.2.0
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads Arch_Bac_Fungi_all_Orig_1349660.qza
--i-reference-taxonomy Arch_Bac_Fungi_taxo_all_Orig_1349660.qza
--o-classifier Arch_Bac_Fungi_Orig_classifier_v0.1_rerun.qza
--verbose
This was in the log file when I ran it using --verbose
/opt/conda/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.22.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
warnings.warn(warning, UserWarning)
I am also attaching a few csv or txt files that shows the taxonomic classification output using 4 databases
Representative reads for reference
Comparison of all the results from each DBOutput_Comparison.csv (24.4 KB)
Blast results of reads which were wrongly classified by the above classifier.
Can you provide some thoughts on what may have gone wrong and how I can improve the sensitivity and specificity of this large Curated DB?
Thank you very much,
URB