Curated DB classifier results not good

URB · July 23, 2021, 8:31pm

Hello,

I have created a Custom Database using SILVA, RDP, and UNITE sequences. In this,Blast_results.csv (3.7 KB) REpresentative_Seqs.txt (11.9 KB) the reads are filtered based on taxonomy and include only those which have full taxonomic nomenclature (till Family level at least). I was able to run the training classifier (for a number of days on my organization’s HPC) using qiime2 version 2020.2.0. Along with the combined DB (Archaea, Bacteria, and Fungi), I have created individual Bacteria and Fungi DB too.

For testing purposes, used a mock dataset of 16S and ITS sequences. For comparison purposes, I used SILVA and UNITE DB downloaded from their websites too.

Unfortunately, I find the SILVA and UNITE DB were able to identify all the genus present in the mock dataset. While the Custom database is sensitive enough to classify the exact kingdom, failed to correctly identify the correct genus and at the family level for few representative reads. The results were consistent in the Curated All DB (Archaea, Bacteria, and Fungi), only Bacteria DB, and only Fungi DB. For troubleshooting I did the following:

1: I verified, if these genus reads are present in my custom database (yes ample copies). Also, I have used SILVA reads(with good nomenclature), so they are the same which are present in the SILVA DB.
2: Re- trained the combined DB and tested on the mock dataset. Unfortunately, got the same results.
3: Created 2 Blast DBs, one using sequences of the combined (Archaea, Fungi, Bac) DB and the other using the Bacteria sequence DB and verified if I get correct hits for those reads which were wrongly identified using qiime classifier. I did get correct hits (see excel sheet)
4: Clustered the representative reads to see % identity of those reads which were wrongly classified. They were more than 90%.

I used the same qiime2 version for testing as was used for classifying. Yet SILVA and UNITE seem to provide better results.

Here are the commands I used for training the classifier:-

module load qiime2/2020.2.0

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads Arch_Bac_Fungi_all_Orig_1349660.qza
--i-reference-taxonomy Arch_Bac_Fungi_taxo_all_Orig_1349660.qza
--o-classifier Arch_Bac_Fungi_Orig_classifier_v0.1_rerun.qza
--verbose

This was in the log file when I ran it using --verbose

/opt/conda/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.22.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
warnings.warn(warning, UserWarning)

I am also attaching a few csv or txt files that shows the taxonomic classification output using 4 databases

Representative reads for reference
Comparison of all the results from each DBOutput_Comparison.csv (24.4 KB)
Blast results of reads which were wrongly classified by the above classifier.

Can you provide some thoughts on what may have gone wrong and how I can improve the sensitivity and specificity of this large Curated DB?

Thank you very much,

URB

timanix · July 24, 2021, 7:28am

Hi!
I am not an expert in databases, but just one quick question - is it possible that you have identical, highly similar or identical sequences, different only in size, that represent the same taxa, but differently named in different databases? My guts are telling me that this may mess with classifier confidence levels (I am not sure about it, though).

URB · July 27, 2021, 3:54pm

Hello Timur,

I used 2 databases.

the original database which consists of reads from SILVA and RDP database (Mind you the RDP database reads are not similar to SILVA).
the second database is a deduplicated version of the original. I removed duplicate reads using seqkit. There may be few reads which may be have some % of identity but fully identical one are removed.

For Bacteria, my curated DB has the same reads which SILVA public DB has (and RDP), yet, my DB was not able to correctly identify and provide the nomenclature till the genus level for few reads. When I tried BLAST (results were shared earlier), they gave me correct hit (except for one where the top 2 hits are from different genus).

Secondly, during my curation step, I compared the SILVA and RDP reads based on taxonomy at the Family level. Next, I found out which are unique families and common families in both databases. And lastly, I took unique family reads from SILVA and RDP and common families only from SILVA. Hence there is no chance that reads from a single family will come from both RDP and SILVA DB.

For Fungi, I used only UNITE DB. Hence I was not able to understand if the read and nomenclature is same in the public UNITE DB and my curated Fungi DB, why should I get different results?

Thank you,

URB