UNITE classifiers

colinbrislawn · January 17, 2022, 1:12am

I'm building a workflow to build pre-trained classifiers for the UNITE database, and thought other folks on the forums might be interested in the output. I wanted to get your feedback before I published a draft.

(Please @mention anyone who I missed!)

I noticed that UNITE ships three clustering levels

97
99
dynamic

and four taxa scopes

"" Fungi, Includes singletons set as RefS (in dynamic files)
"s_" Fungi, Includes global and 97% singletons
"all_" All Euks, Includes singletons set as RefS (in dynamic files).
"s_all_" All Euks, Includes global and 97% singletons.

For the taxa scopes, does "s__" mean that it includes singletons, or does it mean something in addition to that?

Sydney, I noticed when you shipped the v8.0 classifiers, only 99% and dynamic were included. Is this because these are preferable to the 97% databases, that might be best with the old 97% similar definition of OTUs?

For the four taxa scopes, would "s__" usually be preferable to the 'normal' ones?

Finally, I read on the forums that the Qiime devs and Unite devs were interested in teaming up and shipping these pre-trained classifiers in a more official context. I wanted to check in and see if there was an 'official' release planned, or if a community contribution like this could be helpful.

Thanks!

Nicholas_Bokulich · January 17, 2022, 10:15am

Hi @colinbrislawn ,

Regarding your question about different OTU clustering levels for UNITE:

We recently tested several UNITE versions, you can see the results here:

Clustering does not make that much of an impact but I would discourage you from using the 97%... you do lose some taxonomic information. 99% or dynamic would be best. It is probably not worth the trouble of training multiple, as the differences are quite minor.

Many sequences in the database are unannotated or partially annotated (e.g., not annotated at order level or below). We see large accuracy improvements by removing these unannotated sequences, but clustering threshold is relatively minor.

This forum? Do you have a link? I do not believe that this has moved forward, but there are many Q2 users who are analyzing ITS data, so it would be a good idea.

colinbrislawn · January 17, 2022, 2:01pm

I've attached the Snakemake report, if folks are interested. report.html (1.3 MB)

I'll cross-link the developer discussion after work today.

I look forward to hearing from more people about this. Let me know how I can help!

colinbrislawn · January 19, 2022, 12:20am

I have made this repo public: Releases · colinbrislawn/unite-train · GitHub

Feedback welcome!

Nicholas_Bokulich · January 19, 2022, 8:15am

Hey @colinbrislawn ,
Thanks for putting this together and sharing!

I discussed with @thermokarst on the side, and I recommend adding your workflow to the pretrained classifiers workflows here:

These are the workflows that the Caporaso lab runs to generate the pre-trained classifiers that are released with each QIIME 2 release on the data-resources page. Adding your workflow there (incl. in the Makefile) will "add it to the queue", and then these files can be updated with each Q2 release.

I notice that you download everything (all OTU thresholds, fungi and "all", and s__/non-s__). If you want these added to the data-resources page I am not sure that we will want pre-trained classifiers for everything, as it is a lot to maintain and some versions (e.g., 97% OTUs) are clearly not needed. I recommend focusing, and users with very specific needs can go back to the source (after all, we do not want to replace UNITE and their great efforts in providing pre-formatted files for the QIIME community!)

Now for some less urgent but more long-term ideas/issues (these are issues whether you add it to the data resources page or keep it separate):

UNITE fortunately has a regular release cycle. So this will require some maintenance to keep updated so that the latest versions are shared.
It would be great to add a method for downloading UNITE data in RESCRIPt. Then the version and citation information can be recorded directly in provenance, as well as providing users with more flexibility (e.g., to specify versions and filter/edit the taxonomy to remove missing annotations or extract subregions prior to training a classifier). It might also reduce maintenance burden a little bit. This has been on my mind for a long time but it looks like an issue was never opened, so I opened an issue this morning if you are interested

colinbrislawn · January 22, 2022, 5:26pm

Thank you for your guidance!

I really like the idea of folding this into the primary production pipeline, as that's much more sustainable over the long term. Can you give me some advice on how to start?

My mental model of this pipeline looks like this:
Databases in the wild - The Internet (tm)
method for downloading - rescript/get_data.py:get_silva_data
weighted databases - GitHub - BenKaehler/readytowear: Ready-made Taxonomic Weights Repository
make+slurm pipeline for training - GitHub - caporaso-lab/pretrained-feature-classifiers

Looks like the pretrained-feature-classifiers pipeline was just refactored to use rescript in March 2021, and only does it for silva right now. greengenes still uses a manual download.

It also looks like pretrained-feature-classifiers pipeline does not always use rescript because weights are not supported at the moment.

So... where should I start? Sounds like adding functionality to rescript would be good, as it's upstream in the pipeline. I'm also thinking about how to deploy and test things locally, and I'm guessing rescript would be a good place to start because it does not have the specific back-end tooling of the pretrained-feature-classifiers pipeline.

Nicholas_Bokulich · January 22, 2022, 5:56pm

Hey @colinbrislawn ,
Your mental model sounds spot on.

You can also ignore the weights part for now — this would not impact ITS classifiers (as we currently do not have good standardized databases for obtaining global ITS weights, unlike for 16S).

This would be awesome! You can check out how get_silva_data is set up in RESCRIPt, and a get_unite_data action could probably work the same way — though the PlutoF API would allow querying the UNITE database in a more dynamic way, so this would be a better option (though probably more work to set up).

Pls take a look at the source code and shoot me an email if you have any questions — we could discuss more directly if this looks like something that you would like to tackle.

Sydney_Morgan · January 29, 2022, 10:16pm

Hi @colinbrislawn , sorry for the late response. The answer to your question is simply that because it took quite a while for me to make each classifier, I only made the classifiers that were the most relevant to my research. Definitely not a bad idea to make the 97% classifier available, I just never made one so I couldn't upload it!

colinbrislawn · January 30, 2022, 1:15am

Thank you for the update, Sydney!

Can you tell me more about these 4 file types? I called them 'taxa scopes,' but I'm not sure that's the best way to describe them...

What's in these files?

"" Fungi, Includes singletons set as RefS (in dynamic files)
"s_" Fungi, Includes global and 97% singletons
"all_" All Euks, Includes singletons set as RefS (in dynamic files).
"s_all_" All Euks, Includes global and 97% singletons.

For the taxa scopes, does "s__" means that it includes singletons, or does it means something in addition to that?

Thanks!

PS. I have a prerelease of UNITE v8.3 trained on qiime2-2021.11 on GitHub. Let me know if you are comfortable with that, or if you or members of the unite team would like to release something instead. This is my first time doing this and I appreciate any advice you have to offer.