differences between 2022 and 2023 release of UNITE

Hi @colinbrislawn
I have been using UNITE v9 based classifiers based on the 2022-10 release and recently used your pre-trained classifiers based on the 2023-07 release. For some reasons, which I cannot expain at the moment, I get very different taxonomic classifications using the same ITS1F-ITS2 amplicon but the two different classifierst (2022 and 2023 release). In order to sort out own mistakes, I compared various classifiers (pretrained or own-trained on the dynamic_all reference sequences) using q2-amplicon-2024.2 installation in conda:
A: pretrained from Github for 2022 and 2023 release
B: trained by qiime rescript evaluate-fit-classifier for 2022 and 2023 release
C: trained by qiime feature-classifier fit-classifier-naive-bayes for 2022 and 2023 release
(B and C give identical results for the 2022 data, but differ slightly for the 2023 data, not shown)

The results are the same for A, B, and C: There is good taxonomic classification using the 2022based classifiers, but not for the 2023 ones.
I have attached taxa-barplot qzv's using the A and C classifiers as examples. I am aware that some of the pretrained classifiers are from the developer's section of UNITE, but I don't think this is the issue here.

I am aware, that you are not part of the UNITE team, but I would like to sort out q2-related issues first.
I would appreciate any comments to this observations.
Best,

taxa-barplot_A_2022.qzv (1.1 MB)
taxa-barplot_A_2023.qzv (589.5 KB)
taxa-barplot_C_2022.qzv (1.1 MB)
taxa-barplot_C_2023.qzv (589.5 KB)

1 Like

Thank you for bringing this to my attention.

I have some other commitments, then I'll look into this. I appreciate your detailed testing across multiple training strategies.

What's the expected composition for this sample?

Is the newest 2024 version more similar to the good 2022 version, or the bad 2023 version?

Thank you for your quick reply.
The expected composition is certainly something like the results obtained with the 2022 UNITE release, fungal endo- and epiphytes in and on various tree leaves.

And thank you for pointing to the UNITE 2024 release a few days ago. I missed it since there are not yet any qiime2 release data available on the ressources page at UNITE. I will try to find how to get the most recent reference sequences and taxonomies, or will contact UNITE staff...

Best,

1 Like

Hi @arwqiime,

I noticed that you are downloading and importing the UNITE data manually. At least that is what it looks like given the provenance information. You can make use of qiime rescript get-unite-data ... to make your life much easier. It'll also reduce any potential mismatches between the taxonomy and sequence files. See the tutorial here.

Speaking of which, it appears that you are using the UNITE dev files for 2022 and the regular files for 2023. This might be part of the inconsistency you are observing. I suggest using RESCRIPt to fetch these files for you. RESCRIPt is also part of the base QIIME 2 "amplicon" distribution for 2024.2.

I am not too surprised that you'd see differences between database releases, as there are always corrections and taxonomy change updates. That is, removing poor reference data and including better reference data. Perhaps use RESCRIPt to fetch your data and compare the results again?

2 Likes

Hello @SoilRotifer

Yes, I choose manual import in order to compare classifiers as shown above (and to be in line with most pre-trained classifiers, too). As far as I know, there is no possibility to load previous releases of UNITE v.9 (e.g. the 2022-10 release) by rescript get-unite-data (except changing the code, as you pointed out in another post).
However, I do have a classifier loaded this way using the 2023-07 release, and the taxa-barplot looks very similar to the classifier using manually-loaded sequences and taxonomy (but strange compared to the 2022-10 sequence release). Here is the taxa-barplot visualzation.

taxa-barplot_A_2023_rescript.qzv (598.3 KB)

This is true for the pretrained classifiers from Colin's Github page, but not for my own-trained classifiers.
To my knowledge, all these 2022-10 based pretrained classifiers used the developer version according to the provenance information. But they are quite comparable to the non-developer based-classifier version (at least using the same ITS region)!

If you would me advice how to load the Unite v9 2022-10 release via recript get-unite-data, I would ready to calculate a classifier and run the analyses with this one, too. Or would you ready to share such a classifier with me?

Best regards,

I've investigated this a little bit. There are many changing variables, so following engineering principles, I will isolate as many variables as possible to identify the source of difference. Let's see if I can replicate your findings!

First, let me establish priors and compare them with yours:

  • Methods A, B, and C, should be identical, as they all call the same underlying software. Differences between methods reflect different defaults or bugs within these respective pipelines.
  • Different versions of Qiime2 should produce similar or identical results.
  • Different versions of UNITE will produce different results because they have updated/improved the database

Does this make sense?

I inspected graphs for 2022 vs 2023.
2023 has far more unclassified for both A and C.
This is bad. Let's look for a difference.

That's correct; I use the _dev version in the copies I distribute on GitHub.

I checked the provenance to compare input data:

A_2023 does not use the _dev fasta file!

This is not my file!

Not from my GitHub!
(Note that's I'm importing using MixedCaseDNA... and using the _dev.fasta file)


Looks like some files got mixed up leading to confusion. This happens to me a lot, which is why I started using electronic lab notebooks like Jupyter and Rmarkdown.

I find these notebooks essential when hunting for inconsistencies and fixing bugs.

:mag: :bug: :notebook:

Hi @colinbrislawn
Thank you very much for your time to look at my data.
Your 'engineer's' approach makes sense, definitely!

You are right, I mixed it up. Could verify it in my electronic notebook, sorry!
I have seen that here is the MixedCaseDNAFASTAFormat in the import plugin, I will re-train my classifiers using this option.

I have already started to re-analyze my data and included published ITS1 data for benchmarking, and I will now retrain my own classifiers like you did (should then be identical to yours, I just want to repeat it once on my side).

I will look at the results and report it here once I am back in the office.
Best regards,

2 Likes

Hi @colinbrislawn
I have worked on my dataset in the past few days, and I would like to share some results with you.

In order to keep it simple, I just used two of your pretrained classifiers of UNITE v9.0:

  1. unite_ver9_dynamic_all_25.07.2023-Q2-2024.2.qza ("1" in the qzv files below)
  2. unite_ver9_dynamic_all_29.11.2022-Q2-2023.5.qza ("2" in the qzv files below)

Then I took two sets of ITS1 sequences:

  1. mock dataset: one from 10.1111/1755-0998.12760 (ITS1F-ITS2 primers; subset of published data with SRA IDs SRR5838503 to SRR5838532, only the R1 reads, no read joining)) ("mock" in the qzv files below)
  2. my own data (ITS1F-ITS2 primers) ("leaves" in the qzv files below)

The two classifiers and two datasets finally resulted in four taxa-barplots:

taxa-barplot-1-mock.qzv (387.4 KB)
taxa-barplot-2-mock.qzv (380.5 KB)
taxa-barplot-1-leaves.qzv (550.5 KB)
taxa-barplot-2-leaves.qzv (1.3 MB)

While the taxonomic classification of mock data was quite comparable between the 2023 and 2022 releases of UNITE 9.0, the taxonomic classification of the leaves data was very different between the two UNITE releases.
I am not sure whether this could be due to some differences in the import of taxonomy tables before preparing the classifier. I have realized that the UNITE packages contain headerless and header-containing TSV taxonomy files. Before providing more results with my own classifiers (where I removed the header line before importing with HeaderlessTSVTaxonomyFormat), I would like to know your opinion here.

$ head sh_taxonomy_qiime_ver9_dynamic_all_16.10.2022_dev.txt 
SH0901949.09FU_UDB03390183_reps	k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Chaetothyriales;f__Herpotrichiellaceae;g__unidentified;s__Herpotrichiellaceae_sp

$ head sh_taxonomy_qiime_ver9_dynamic_all_25.07.2023_dev.txt 
Feature ID	Taxon
SH1031489.09FU_UDB01372703_reps	k__Fungi;p__Ascomycota;c__Ascomycota_cls_Incertae_sedis;o__Ascomycota_ord_Incertae_sedis;f__Ascomycota_fam_Incertae_sedis;g__Ascomycota_gen_Incertae_sedis;s__Ascomycota_sp;sh__SH1031489.09FU

Before providing more results with my own classifiers (where I removed the header line before importing with HeaderlessTSVTaxonomyFormat), I would like to know your opinion here.
If the header line of the taxonomy table is not a problem (e.g. if the corresponding sequences and taxa are matched using the UNITE ID), I was thinking about an biological difference between the mock and leaves data. The mock data were made from 19 different fungal taxa, and all of them are known by UNITe, while the leaves data are from endophytic fungi, and I don't know, how much of them are indeed represented in UNITE. But this would be the next step which I should investigate.

I would very much appreciate your comments (and I tried not to mix up your classifiers :frowning: )
Best regards,

1 Like

Thank you for sharing these updated results. I can only investigate so much on a volunteer project, but hopefully I can look a little more and get you started.

Mock 1 and Mock 2 look similar expect...

  • mock 1 (from 25.07.2023) has many more unassigned reads compared to mock 2 (29.11.2022).
  • I think this is due to differences in the Unite database, not the Qiime2 pipeline, but I have not benchmarked this.

Comparing leaves 1 and 2

  • leaves 1 has FAR more unassigned reads compared to leaves 2
  • This is the exact same pattern as the simpler mock communities, just with a stronger result.
  • Perhaps something is wrong with my file! unite_ver9_dynamic_all_25.07.2023-Q2-2024.2.qza

Unite v10 just came out, if you want to try that: Releases · colinbrislawn/unite-train · GitHub

The variable we need to isolate is database version vs Qiime2 version.

As long as this is a UNITE issue and not a Qiime2 bug or regression, we can simply point out this problem when we tell reviewers why we are not using the newest Unite database.

This is why trying the newest Unite with the same qiime2-2024.2 release is so interesting!

Keep in touch,
Colin

Changing file format, like removing a header, should not change results at all.

If it does, it's a bug.

Proving that results are identical with and without headers might be the perfect place to start! Otherwise this is another variable you have to track during benchmarking.

I love to remove confounding variables! :axe: :chipmunk:

Hi @colinbrislawn

Yes, I have been informed by a UNITE team member that v10 qiime2 data files are available.
I already had prepared a classifier this week, but did not include it in my recent post in order to keep things simple.
But here are the two taxa-barplot files from my leaves data using my own trained U10 classifier (id 7) and using your pretrained classifier (id 8).
taxa-barplot-7-leaves.qzv (544.1 KB)
taxa-barplot-8-leaves.qzv (543.4 KB)
I consider both results as identical, showing an extremely high proportion of unassigned reads.

I decided to go back to q2-2022.11, which was the previous qiime2 release that I used until February, 2024. I downloaded your pretrained classifier unite_ver9_dynamic_all_29.11.2022-Q2-2022.11.qza and trained the most recent Unite 10 developer release with q2-2022.11 (did not use your UNITE 10 pretrained one, which was built with q2-amplicon-2024.2).

This barplot used your U9-2022 pretrained classifier:
taxa-barplot-U9-2022_dyn_all_dev_q2-2022.11.qzv (1.3 MB)

This barplot used my new U10-2024 trained classifier by q2-2022.11:
taxa-barplot-U10-2024_dyn_all_dev_q2-2022.11.qzv (2.1 MB)

Wow!!!
Of course, the fine details of the taxonomic classification is different between the UNITE 9 and UNITE 10 releases, but the proportion of unassigned reads is much smaller in the q2-2022.11 based analyses.
Although I had to redo the preparation steps (cutadapt, dada2) again in q2-2022.11, the number of raw features after dad2 are virtually identical: 2,961 features.

Do you have a suggestion how to preceed from here? Are there any parameters in the most recent q2-amplicon-2024.2 release that I missed, or should consider?

Best regards,

PS: I was quite surprised by the high ITS1 diversity in my leaves data, but I have to admit that we did not clean the leaf surfaces from greenhouse grown trees. A quick inspection of the raw fastq data by kraken2 (using PlusPFP-16 RefSeq indices), however, did confirm that there is quite a sequence diversity in the samples.

1 Like

Interesting! This looks pretty similar to me, with some unknown phyla rolling up to unknown kingdom. :person_shrugging:

Perhaps we should find a way to put a number on how similar these taxonomy results are. Perhaps 'fraction unassigned at kingdom'... A much more powerful metric is the Nearest Taxon Index (NTI, a.k.a. MNTI, MNTD, BMNTD), but that calculation is more involved and uses a tree.

I appreciate your willingness to take the time to share your results and talk me through your conclusions. I know database validation is hard and I think you are doing a remarkably good job.

Depends!

  • If you are looking for a challenge, try beta NTI! :brain:
  • The plugin RESCRIPt, to which Mike and I are both contributors, provides elegant ways to parse and filter taxonomy levels in databases and is easy to use :sparkles:
  • I'm convinced by our discussion here, so this may be good enough :+1: :rocket:

Hi @colinbrislawn and @SoilRotifer

Thank you for your comments and mentioning the nearest taxon index. I will have a look at it, e.g. when comparing it with an completely independent taxonomic classification such as kraken2. I have also used q2 recript evaluate-composition before in order to test how different classification results are between two different classifiers, or even between different ITS regions.

Have you had the chance to look at the q2 version issue? I am now back at the q2-2022.11 release, and I don't know how to solve the classification issue with the most recent q2 release.

A short follow-up question to Rescript: Is rescript get-unite-data loading the ITSx trimmed or the developer version of the most recent UNITE releases? There is a recommendation at q2 docs to use full reference sequences instead of the trimmed version. Rescript is now part of q2-amplicon-2024.2, but I had compatibility issues in the past when using artifacts created with the newer q2 versions (I believe from q2-203.5 on) due to the new provenance architecture. I will probably not be able to use recript created artifacts in q2-amplicon-2024.2 within a q2-2022.11 environment.

Best regards,

RESCRIPt will download the:

This seems to work well for us. I've not been able to reach the UNITE devs for clarification. (If you hear from them please post on the forums!)

I conclude that changes in UNITE cause slight changes in results.

(Did I miss something? To prevent confounding variables, only the version of Qiime2 can change during this test.)

EDIT: Correct, scikit-learn NB classifiers must be trained and used on the same version of QIime2.
EDIT: The same version of Scikit-learn must be used to classify as was used to train.
Different versions of Qiime2 might have matching Scikit-learn versions, or not!
To ensure compatibility, I train and test with the same version of Qiime2.

However, you can take a dada2-rep-seqs.qza files and assign taxonomy with any version of Qiime2. So you can try out the newest classifier in the newest Qiime on ancient DADA2 data. :hourglass:

1 Like

I wanted to see if changes to the version of Qiime2 changed the results of the skl classifier. So I tested one database with three versions of Qiime.

Please inspect the included provenance to see if I'm missing anything:

unite_ver9_dynamic_29.11.2022-Q2-2022.11.qzv (334.9 KB)
unite_ver9_dynamic_29.11.2022-Q2-2023.2.qzv (334.9 KB)
unite_ver9_dynamic_29.11.2022-Q2-2023.5.qzv (335.3 KB)

Do you mean the unite version issue?

1 Like

Hi @colinbrislawn
I have looked at your test results, and these small test are very similar to each other, and this is also what I found out earlier in this post using mock reads data from just 19 taxa.

When I use my real dataset, the results are quite different. And I am trying to find out why the same classification pipeline, that is giving good results on mock data, seems not to work on real data.

You are perfectly right, that I should try to narrow down differences t variables. What I have shown in my last post is that the taxonomic results are very comparable between UNITE 9 and UNITE 10 reference classifiers using q2-2022.11:

Now I used only one UNITE 10 reference data (developer, dynamic, all=eukaryotes) and varied the two steps necessary for this analyses:

  • qiime feature-classifier fit-classifier-naive-bayes ('fcnb' using q2-2022.11 or q2-2024.2)
  • qiime feature-classifier classify-sklearn ('cs' using q2-2022.11 or q2-2024.2)

A combination of 2x2 variables gives four taxa-barplot results. First a short overview of the taxa-barplot al level 7, below the qzv files:

Here are the taxa-barplot files (sorry for the not consistent filenames, but the forum system seems to replace filenames of previously uploaded files with the original name):

(trained q2-2022.11 / classified q2-2022.11)
taxa-barplot_U10_fcnb-q2-2022-11_cs-q2-2022-11.qzv (1.7 MB)

(trained q2-2022.11 / classified q2-2024.2)
taxa-barplot_U10-q2-2022-11_q2-2024-2.qzv (1.7 MB)

(trained q2-2024.2 / classified q2-2022.11)
taxa-barplot_U10_fcnb-q2-2024-2_cs-q2-2022-11.qzv (1.7 MB)

(trained q2-2024.2 / classified q2-2024.2)
taxa-barplot-7-leaves.qzv (544.1 KB)
I get the same results when using Colin's pretrained U10 classifier, see taxa-barplot-8-leaves.qzv above)

Concerning the re-use of classifiers:

There is a warning while training the classifiers:

UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.24.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)

My understanding was that if different qiime2 version use the same scikit-learn version for training classifiers, it should be possible to use them for all of the q2 releases containing the same scikit-learn version. Both q2-2022.11 and q2-amplicon-2024.2 use scikit-learn version 0.24.1.

Edit: Artifact version compatibility:

As I could find out here, q2-amplicon-2024.2 trained classifiers can be used by classify-sklearn in q2-2022.11; :slight_smile: There seems no downgrade incompatibility.
I tried another action with dada2 tables created and q2-2024.2, and a sample filter in q2-2022.11 q2 feature-table filter-samples and get previously reported error:
table-filtered-2.qza was created by 'QIIME 2024.2.0'. The currently installed framework cannot interpret archive version '6'.
There is some but not complete downgrade incompatibility. This is not an issue, I just wanted to clarify my sentence above. :slight_smile:

Sorry for this long post, and thank you for your willingness to provide valuable tips on a voluntary basis! I appreciate this very much!
Best regards,

2 Likes

Thank you taking the time to explain this. I feel I was rushing.

I also made a mistake by over-extrapolating from results of my mock data. Untested taxa may have different results, which is totally expected!

That's right. I re-release unite-train with each Qiime2 version so it's easy for people, though it's only needed when the skl version changes. I will fix this comment!


That 2x2 blocked design of training and classifying is perfect.
There has got to be a bug.
I will look into this and report back.

1 Like

Found it:


The input manifest is different for that one graph.

For the 3 'good' files, the input to cutadapt is the same.
For the other one it's different.

  • demultiplexed_sequences:"28e7533a-dbc5-41f3-b493-fd69279ac1ec"
    This ID is consistent in the working files

The broken file is run a day earlier on April 3rd, 2024

1 Like

Hi @colinbrislawn
This observation to the initial data import is really surprising, thank you so much for finding it out.
The March artifact was created by my student on an different machine (same OS, same q2 version), and she used the SingeEndFastqmanifestPhred33V2 format with a two tab-separated manifest file.

When we observed the bad classification, results, I transferred her dada2 output to my machine and started the different thestings, using her an my own pipeline. I was using a SingeEndFastqmanifestPhred33 with a comma-separated three-column manifest file (I am used to it since years, but may be I should change to the newer version).

I was not worried by this import differences as the overlap of the dada2 representative sequences was perfect.

This Venn diagram is based on the md5 hash ids, and I was assuming that the two representative sequences created by my student and me were identical.

But the good results is now: I have reanalyzed the data set in parallel (using the two manifest format definitions) from the very beginning (import > cutadapt > dada2 > classification), and I do get the good taxonomic classifications.

I do not know what could have caused these problems, and I don' know if there are any such problems known when changing machines, but I can imagine that there are many possibilities.
But let's consider this issue as solved!

Again, many thanks for your great support!
Best regards,

2 Likes