Trouble training sk classifier to identify particular protist

anapopov · April 27, 2023, 10:02pm

Hi,

I am having a bit of trouble with the classifier identifying a protist (Dientamoeba). It is either identified to only Kingdom or Phylum, or it is misclassified to the wrong class/family/genus. I've tried both pre-trained (which I realized doesn't seem to have this organism in the ref data), and custom trained classifier (where I have added Dientamoeba along with several other protists).

I saw a similar question previously posted (Issues with training classifier to PR2 database - #2 by Nicholas_Bokulich) but unfortunately it doesn't resolve my issue.

The classifier does well with other representative protists, including those others for which I added sequence data into the classifier - many assigned to genus or species level. Other reference sequences most closely related to Dientamoeba are at 80% seqID or below in the amplicon region. Could this problem be related to inaccurate taxonomies in the reference data? Or too few related taxa in the dataset, while training? I'm wondering if there is a way to help it along.

(details below)

Thanks! ~Ana

Details: I am using the 2021.11 singularity qiime2. I appended 15 sequences/taxonomies to the silva nr 99 138 data; extracted the amplicon region (VR4-5) using the "qiime feature-classifier extract-read" command with primers; verified the sequence for my organism of interest is correct; and trained the classifier. But when testing it with the same (trimmed) reference sequence input, I get the wrong class/family, or just the Kingdom or Phylum.

SoilRotifer · April 28, 2023, 2:14pm

Hi @anapopov,

Short answer... it could be all of the above.

When you say that you appended to the SILVA nr database. How did you prepare the database prior to training the classifier? Did you use an approach similar to what is outlined here, something else?

I ask because, often longer / full-length sequences might be able to differentiate among taxa, but the extracted amplicon region may not be able to do so. That is, the targeted amplicon region may contain identical sequence across disparate taxa. Thus, losing taxonomic resolution in being able to differentiate some taxa from one another. I suspect that is what is happening here.

What happens when you run qiime rescript dereplicate ..., using the --p-mode lca option, on extracted V4V5 region? Do you still observe the taxonomic groups of interest in the output? If not, then this means that there are identical sequences with differing taxonomies in the reference database. Compare with the --p-mode uniq option, which will keep replicate sequences in the file only if there is a differing taxonomy.

anapopov · May 3, 2023, 5:52pm

Thanks for the response, Mike!

I didn’t use rescript, but I did more or less follow the steps outlined in the link under ‘SILVA compilation pipeline’. (I did try rescript, but had trouble setting it up – I hit the max allowed files on the cluster while setting up the conda environment.. Any plan for a newer containerized qiime release with rescript? Would be great).

I added the full-length sequences for ~15 organisms, but I did run a sequence search of just the amplicon regions to ensure there are no duplicates in the SILVA v138 reference set. The closest amplicon match for Dientamoeba is ~85% - those taxa it is misclassified for. That makes me wonder if it is an issue with the taxonomies..

I tried using the full-length classifier as you suggest, but unfortunately it had worse overall performance for the protozoa, and was no better for Dientamoeba. Perhaps the poly-dT/low complexity stretches Dientamoeba are problematic (pasted at the bottom). Some of these look like indels in alignments..

If you have any other ideas of what to try, please let me know.. maybe to remove the batch it is misclassified for, or to add a few similar dummy sequences, just to see if the classifier behaviour changes..?

Dientamoeba 18S V4+V5:
TGCAAGTTTGCTCCCATATTGTTGTAGTTAAAACGCTCGTAGTCTGAATTATTTTAATTTAAATTTTTTAAATTAAAATTTAGTTTTTATTTTATAAAAACGTTCACTGTGGAACAAATCAGAACGCTTAAAGTAATTTTCTTTATTGAATGATTTAGCGCAGTATGAAATTTTTACCTTTTAAATTTTAATTAATTTAACAAGTAATATCAAAGAGAATAATCGGGGATAGATCTATTTCATGGCGAACAGCGAAATGTTTTGACCCATGAGAGAGAAACGAAGGCGAAAGCATCTATCAAGTGTATTTCTATCGATCAAGGGCGAGAGTAGGAGTATCCAACCGGATCAGAGACCCGGGTAGTTCCTACCTTAAACTATGCCGACAAGGTTTTGTTTTTTTTAATAAAAGCAGTACCATAGGAGAAATCATAGTTCATGGGCTCTGGGGGAACTACGACCGCAAGGCTGAAACTTGAA

SoilRotifer · May 3, 2023, 8:16pm

You can follow this approach to add plugins to docker containers.

So, I just ran the sequence you provided:

... against the full-length SILVA 138.1 reference database, as processed via RESCRIPt, with feature-classifier classify-sklearn and I was able to obtain the following assignment:

d__Eukaryota;p__Parabasalia;c__Tritrichomonadea;o__Tritrichomonadea;f__Tritrichomonadea;g__Dientamoeba;s__Dientamoeba_fragilis

I also obtained the same result using ACT. Is this what you were hoping to see?

-Mike

anapopov · May 3, 2023, 10:23pm

Yes!
That's odd. I extracted the sequences/taxonomies from the SILVA 138 qza files linked on the qiime2 site, and Dientamoeba wasn't there at all. I will try downloading from SILVA directly and see what happens.

ACT doesn't work for protozoa Entamoeba dispar and Entamoeba coli, so I was trying to see if I could customize the qiime classifier.

Will rescript work with the 2021.11 version? I will have a go at setting it up.

Thanks Mike.

Ana~

anapopov · May 3, 2023, 10:44pm

(just as FYI, I was getting one of the below taxonomic assignments:
d__Eukaryota; p__Parabasalia;
d__Eukaryota; p__Parabasalia; c__Cristamonadea; o__Cristamonadea; f__Cristamonadea; )

SoilRotifer · May 8, 2023, 5:20pm

I think we may need to update the reference database used for the pre-curated databases. I think those files are still generated using SILVA 138, but if you run rescript get-silva-data directly, it will pull SILVA 138.1, which has some corrections.

The current release is for 2023.2, but you can download and install older versions of RESCRIPt, like 2021.11, here.