I have a question regarding taxonomic classification, i guess.
After further investigating significant features from differential abundance testing, i discover some sort of discrepancies in the Taxon field of the classification file. For example this is the Taxon field for one of the significant features:
D_0_Bacteria;D_1_Proteobacteria;D_2_Gammaproteobacteria;D_3_Betaproteobacteriales;D_4_Burkholderiaceae;D_5_Janthinobacterium. Feature_ID: 31e24cf494268b6c24c49953551fbf5e. Confidence: 0.993. Sequence Length: 436.
If i am not mistaken Janthinobacterium belongs to the family Oxalobacteraceae and not Burkholderiaceae event though both of these families belong to the order Burkholderiaceae. When trying to BLASTN these features, the blast just keeps going for 30 or more minutes. I know it`s not under heavy load because if i try to blast some Pseudomonas feature at the same time and it returns the results quite quickly.
I would like to find out if there is something wrong with my sequences, the taxonomy file, classification process or something else because now i have reservations about believing these results.
Additional information:
Workflow in short: ‘DADA2’ to get features, classify using SILVA 132 99% similarity (V3-4 regions extracted using primers), perform differential abundance testing using ‘ANCOM’.
Files used for classification: ‘silva_132_99_16S.fna’ and ‘majority_taxonomy_7_levels.txt’.
Mean sequence length: 426.
I think the issue is likely in the database. There’s often a discrepancy between the different taxonomic databases, so depending on where you’re pulling that information, there may be a difference. There was also a major update to either Silva 132 or 138 to incorperate the GTDB naming scheme, which shook things up. I would check the Silva release notes to be sure when GTDB was included.
I would personally recommend caution in manual curation without making it very clear how and why you’ve made changes, just for the sake of future reprooducibility.
@jwdebelius Thank you for the quick reply.
After checking out the link you provided i have found that GTDB was introduced in the 138 version of SILVA. I guess i should look into making a 138 classifier and redoing it. Also i was not considering manually editing the database as that does not seem like good practice.
Just to be sure - the 132 classification at the genus level is probably correct and could be used while ignoring mistake at the family level?
You can find information about a QIIME 2 formatted prototype of SILVA 138 here:
There will be more information related about this topic soon. So keep your peeled.
To echo, @jwdebelius comments, there are ever ongoing updates with respect to taxonomy. So much so, that there can be substantial conflicts between databases. The online version SILVA db allows you to compare among various taxonomies, like GTDB, etc...