SILVA vs. Greengenes2 2024.9 yield different taxonomic classifications (of chloroplast)

scilexenko · November 14, 2024, 8:26am

Dear Daniel,

First, thank you for the updated version of your classifier—truly impressive work.

However, I’m encountering some confusion with the taxonomic assignments. For context, I’m working with plant and soil microbiota. We typically expect chloroplast sequences in root samples, and they are rarely found in soil samples.

I used your pre-trained classifier, 2024.09.backbone.full-length.nb.qza(now including chloroplast and mitochondrial sequences), to assign taxonomy to my sequences. For comparison, I also used the standard silva-138-99-seqs.qza classifier.

To my surprise, the sequences assigned to chloroplasts d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__Chloroplast;f__Chloroplast;g__Chloroplast;__ with Silva,

were assigned to bacteria d__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Marinilabiliaceae; g__JC017; s__JC017 sp004296775.

Could you shed some light on why the classifier is assigning chloroplast sequences to bacterial taxa?

Respectfully,
Ivan

feature-classifier feature-table diversity User Support #marinilabiliaceae

wasade · November 15, 2024, 10:31pm

Hi @scilexenko,

Could you provide a few of the sequences which receiving differential classification?

Best,
Daniel

scilexenko · November 18, 2024, 3:50pm

Hello Daniel,

Please find attached.

Ivan

f__Marinilabiliaceae.csv (16.6 KB)

wasade · November 19, 2024, 1:08am

Thanks!

Checking the first sequence, its first 150nt are placed within Cyanobacteria (n.b., the HTTPS certificate is expired, I've let our admin know).

BLAST of the record against nt/nr strongly suggests chloroplast.

The exact genome providing the species name, which the Naive Bayes model is using for its call is GCA_004296775.1, which appears to be a complete genome with reasonable quality. However, NCBI reports the corresponding RefSeq record as suppressed due to contamination. The underlying Genbank record can be found here.

A superficial naive check of that record for these queries, by exact match, shows no hits (selected output at the end of the message).

I'll need to examine the k-mer profile closer but cannot do that today. What I currently suspect is that the GCA_004296775.1 record should be suppressed from k-mer training, and from future releases of Greengenes2, but I would like to definitively understand what's going on here so we can examine whether other taxa are affected similarly.

Thank you again for sharing this. I hope to have a more definitive statement following further examination.

Best,
Daniel

In [2]: r = skbio.read('GCA_004296775.1_ASM429677v1_genomic.fna.gz', format='fasta')

In [3]: seqs = list(skbio.read('GCA_004296775.1_ASM429677v1_genomic.fna.gz', format='fasta'))

In [4]: len(seqs)
Out[4]: 576

In [5]: import pandas as pd

In [6]: queries = pd.read_csv('f__Marinilabiliaceae.csv')

In [7]: len(queries)
Out[7]: 40

In [8]: queries.head()
Out[8]: 
                                 id                                           Sequence                                              Taxon  Confidence
0  4a0b292ba716582f9af46694458c0b9b  TACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTC...  d__Bacteria; p__Bacteroidota; c__Bacteroidia; ...    0.899540
1  9e701e82d552ee66d9007fa3883b6560  GACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTC...  d__Bacteria; p__Bacteroidota; c__Bacteroidia; ...    0.904273
2  16bd6a23504b0bd9e7ece5e9711458e1  GACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTC...  d__Bacteria; p__Bacteroidota; c__Bacteroidia; ...    0.950112
3  0a3cf58d4ca062c13d42c9db4ebcbc53  GACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTC...  d__Bacteria; p__Bacteroidota; c__Bacteroidia; ...    0.801568
4  e1b2b6ae4af6686429c21926cc0314a0  TACAGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTC...  d__Bacteria; p__Bacteroidota; c__Bacteroidia; ...    0.923335

In [10]: for s in queries['Sequence']:
    ...:     for r in seqs:
    ...:         if s in str(r):
    ...:             print("hit")
    ...:             break
    ...: 

In [11]: for s in queries['Sequence']:
    ...:     for r in seqs:
    ...:         if s[:150] in str(r):
    ...:             print("hit")
    ...:             break
    ...: 

In [16]: for s in queries['Sequence']:
    ...:     for r in seqs:
    ...:         if s[:150] in str(skbio.DNA(r).reverse_complement()):
    ...:             print("hit")
    ...:             break
    ...:

scilexenko · November 19, 2024, 8:08am

Thank you for your examination ! I appreciate your efforts to investigate this issue thoroughly.

In the meantime, I’d like to know your suggestion for how to proceed. Do you think it would be better to exclude the Marinilabiliaceae-assigned sequences from further analysis?

Thank you for keeping me informed, and I look forward to hearing from you once you’ve had the opportunity to examine this further.

Respectfully,
Ivan

wasade · November 19, 2024, 4:38pm

Hi Ivan,

You're quite welcome, and thank you for flagging this.

Given that the genome associated with JC017 sp004296775 is a suppressed RefSeq record, it was isolated from a very different environment (seawater), a simple BLAST of one of the ASVs strongly supports chloroplast, and separate evidence Naive Bayes classification supports chloroplast, I think a pragmatic solution would be to remove records which are assigned to that genus. I don't have evidence to suggest the family is unusual, but this genome record appears to be the naming member of that genus in GTDB. There are resolution limits with 16S too so exclusion at genus would not be unreasonable.

Best,
Daniel