Choosing UNITE release for classifier training

mdziurzynski · November 1, 2018, 11:30pm

Hello,

I am trying to correctly assign taxonomy of my ITS2 sequences using UNITE database. I would like to train my own classifier, but I don't know which version of the UNITE database should I use, should I use:

database labeled as "Includes singletons set as RefS (in dynamic files)."(sh_qiime_release_01.12.2017.zip)
or
database labeled as "Includes global and 97% singletons." (sh_qiime_release_s_01.12.2017.zip)

I understand that I should use the developer version and avoid trimming the database to my primer sequences, but I am still struggling to understand which of the above mentioned versions should I use.

Which of them do you use and why?

I will gladly welcome any help.

Nicholas_Bokulich · November 2, 2018, 2:22pm

Hi @mdziurzynski,
Those are just different OTU clustering thresholds of the same database. This is more of a UNITE question than a QIIME 2 question — see this paper for more description of those thresholds and the dynamic clustering.

mdziurzynski · November 3, 2018, 12:56pm

Dear @Nicholas_Bokulich,
thank you very much for your response and pointing me to that paper, it helped a lot. Just to be sure, please correct me if I am wrong: the second version of UNITE database is 2x bigger because it includes SHs that contain only one sequence (ie. singletons).

This was my problem, I wasn't able to understand what are RefS based singletons vs global and 97% singletons.

Once again, thanks a lot for your help!

Nicholas_Bokulich · November 3, 2018, 4:46pm

You should ask the UNITE developers; I do not know the details. This page suggests otherwise:

Following Kõljalg et al. (2013), each terminal fungal taxon for which two or more ITS sequences are available is referred to as a species hypothesis (SH). One sequence is chosen to represent each SH; these sequences are called representative sequences (RepS) when chosen automatically by the computer and reference sequences (RefS) when those choices are overridden (or confirmed) by users with expert knowledge of the taxon at hand.