UNITE database use for classification

evepyrenees · April 14, 2026, 8:59pm

Hello!

I am working with ITS1 data and have used a workflow borrowing heavily from the Langille lab’s microbiome helper, with a few fungal modifications including extracting the ITS1 region (Microbiome Helper 2 Marker gene workflow · LangilleLab/microbiome_helper Wiki · GitHub).

For step 3, Assign taxonomy to ASVs, I trained and used a classifier based on the most recent reference available on UNITE (PlutoF DOI) using qiime feature-classifier fit-classifier-naive-bayes and qiime feature-classifier classify-sklearn. I now have a taxonomy.tsv that appears to match what I expected to see based on my samples.

My question involves the different types of feature IDs provided. I have a mix of representative sequences (SH0016447.10FU_KX515298_reps) and reference sequences (SH0016832.10FU_MG593539_refs). From what I understand, these are artifacts of how UNITE selects sequences to represent species hypotheses, but that including both can inflate diversity. I am wondering what common practice is to mitigate this? I have read that I can use qiime collapse to group by species hypothesis, but there seems to be a lot of caveats, and I am not confident in how I can do this while maintaining taxonomic information and abundance data.

Any insight would be extremely appreciated.

Thank you very much.

colinbrislawn · April 22, 2026, 4:15pm

Hello @evepyrenees,

I can help a little with UNITE.

https://doi.org/10.15156/BIO/3301242

That DOI matches the UNITE database version with an "S".

While I don't know exactly what the _s_ means, I think it means 'singletons'.

No S:

Includes singletons set as RefS (in dynamic files).

With S:

Includes global and 3% distance singletons.

I tried using the _s_ and non-s versions when I trained a skl classifier for the UNITE database and found that including singletons uniformly reduces the F-score by a few percent.

So, I stopped using the S version of the database for my pretrained UNITE classifiers.

Let us know what you try next, or if you are able to talk to the UNITE devs are learn more!

Like, the _s_ database must be useful for something! Please let me know if you find it!

Quick note:

It depends on how ASVs are made, and when in the pipeline they are made.

If you are working with DADA2 ASVs, for example, then the ASVs are made before taxonomy assignment and classification. The ASVs are unchanged, so diversity will be unchanged.

If you are working with close-ref OTU counting, then changing the database would also change the ASVs, and thus change the diversity too!