Unexpected taxonomic assignment results from two classifiers trained on nearly identical sequences

mammerlin · January 8, 2025, 1:51pm

Hi,
I got unexpected taxonomic assignment results from two classifiers trained on nearly identical sequences. Even I found the way to solve the problem (which is setting the --p-read-orientation same in q2-feature-classifier classify-sklearn), I really want to know the reason caused the unexpected results.

I used two classifier which were trained using silva 138 and 138.1, respectively.

1.DB-silva138 was downloaded from QIIME2 webiste's data resource (Data resources — QIIME 2 2024.10.1 documentation).

Silva 138 SSURef NR99 full-length sequences (MD5: de8886bb2c059b1e8752255d271f3010)
Silva 138 SSURef NR99 full-length taxonomy (MD5: f12d5b78bf4b1519721fe52803581c3d)

After downloaded DB-silva138, I also processed using RESCRIPt. Here is the provenance of DB-silva138 classifier:

qiime rescript get-silva-data \
  --p-version 138 \
  --p-target SSURef_NR99 \
  --p-include-species-labels \
  --p-download-sequences \
  --o-silva-taxonomy silva-taxonomy-0.qza \
  --o-silva-sequences silva-sequences-0.qza

qiime rescript cull-seqs \
  --i-sequences silva-sequences-0.qza \
  --p-num-degenerates 5 \
  --p-homopolymer-length 8 \
  --o-clean-sequences clean-sequences-0.qza

qiime rescript filter-seqs-length-by-taxon \
  --i-sequences clean-sequences-0.qza \
  --i-taxonomy silva-taxonomy-0.qza \
  --p-labels Archaea Bacteria Eukaryota \
  --p-min-lens 900 1200 1400 \
  --o-filtered-seqs filtered-seqs-0.qza \
  --o-discarded-seqs XX_discarded_seqs

qiime rescript dereplicate \
  --i-sequences filtered-seqs-0.qza \
  --i-taxa silva-taxonomy-0.qza \
  --p-mode uniq \
  --p-perc-identity 1.0 \
  --p-threads 1 \
  --p-rank-handles silva \
  --p-no-derep-prefix \
  --o-dereplicated-sequences dereplicated-sequences-0.qza \
  --o-dereplicated-taxa dereplicated-taxa-0.qza

qiime rescript cull-seqs \
  --i-sequences dereplicated-sequences-0.qza \
  --p-num-degenerates 5 \
  --p-homopolymer-length 8 \
  --p-n-jobs 1 \
  --o-clean-sequences clean-sequences-1.qza

qiime rescript filter-seqs-length-by-taxon \
  --i-sequences clean-sequences-1.qza \
  --i-taxonomy dereplicated-taxa-0.qza \
  --p-labels Archaea Bacteria Eukaryota \
  --p-min-lens 900 1200 1400 \
  --o-filtered-seqs filtered-seqs-1.qza \
  --o-discarded-seqs XX_discarded_seqs

qiime rescript dereplicate \
  --i-sequences filtered-seqs-1.qza \
  --i-taxa dereplicated-taxa-0.qza \
  --p-mode uniq \
  --p-perc-identity 1.0 \
  --p-threads 1 \
  --p-rank-handles domain phylum class order family genus species \
  --p-no-derep-prefix \
  --o-dereplicated-sequences dereplicated-sequences-1.qza \
  --o-dereplicated-taxa dereplicated-taxa-1.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads dereplicated-sequences-1.qza \
  --i-reference-taxonomy dereplicated-taxa-1.qza \
  --p-classify--alpha 0.001 \
  --p-classify--chunk-size 20000 \
  --p-classify--class-prior null \
  --p-no-classify--fit-prior \
  --p-no-feat-ext--alternate-sign \
  --p-feat-ext--analyzer char_wb \
  --p-no-feat-ext--binary \
  --p-feat-ext--decode-error strict \
  --p-feat-ext--encoding utf-8 \
  --p-feat-ext--input content \
  --p-feat-ext--lowercase \
  --p-feat-ext--n-features 8192 \
  --p-feat-ext--ngram-range '[7, 7]' \
  --p-feat-ext--norm l2 \
  --p-feat-ext--preprocessor null \
  --p-feat-ext--stop-words null \
  --p-feat-ext--strip-accents null \
  --p-feat-ext--token-pattern '(?u)\b\w\w+\b' \
  --p-feat-ext--tokenizer null \
  --p-no-verbose \
  --o-classifier classifier-0.qza

In this provenance, the steps of cull-seqs, filter-seqs-length-by-taxon and dereplicate were performed twice. I believe the first execution was carried out by the QIIME2 team before I downloaded the data.

2.DB-silva138.1 was downloaded followed by RESCRIPt tutorial (Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt), and here is the provenance of DB-silva138.1 classifier

qiime rescript get-silva-data \
  --p-version 138.1 \
  --p-target SSURef_NR99 \
  --p-no-include-species-labels \
  --p-rank-propagation \
  --p-download-sequences \
  --o-silva-sequences silva-sequences-0.qza \
  --o-silva-taxonomy silva-taxonomy-0.qza

qiime rescript reverse-transcribe \
  --i-rna-sequences silva-sequences-0.qza \
  --o-dna-sequences dna-sequences-0.qza

qiime rescript cull-seqs \
  --i-sequences dna-sequences-0.qza \
  --p-num-degenerates 5 \
  --p-homopolymer-length 8 \
  --p-n-jobs 1 \
  --o-clean-sequences clean-sequences-0.qza

qiime rescript filter-seqs-length-by-taxon \
  --i-sequences clean-sequences-0.qza \
  --i-taxonomy silva-taxonomy-0.qza \
  --p-labels Archaea Bacteria Eukaryota \
  --p-min-lens 900 1200 1400 \
  --o-filtered-seqs filtered-seqs-0.qza \
  --o-discarded-seqs XX_discarded_seqs

qiime rescript dereplicate \
  --i-sequences filtered-seqs-0.qza \
  --i-taxa silva-taxonomy-0.qza \
  --p-mode uniq \
  --p-perc-identity 1.0 \
  --p-threads 1 \
  --p-rank-handles domain phylum class order family genus species \
  --p-no-derep-prefix \
  --o-dereplicated-taxa dereplicated-taxa-0.qza \
  --o-dereplicated-sequences dereplicated-sequences-0.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads dereplicated-sequences-0.qza \
  --i-reference-taxonomy dereplicated-taxa-0.qza \
  --p-classify--alpha 0.001 \
  --p-classify--chunk-size 20000 \
  --p-classify--class-prior null \
  --p-no-classify--fit-prior \
  --p-no-feat-ext--alternate-sign \
  --p-feat-ext--analyzer char_wb \
  --p-no-feat-ext--binary \
  --p-feat-ext--decode-error strict \
  --p-feat-ext--encoding utf-8 \
  --p-feat-ext--input content \
  --p-feat-ext--lowercase \
  --p-feat-ext--n-features 8192 \
  --p-feat-ext--ngram-range '[7, 7]' \
  --p-feat-ext--norm l2 \
  --p-feat-ext--preprocessor null \
  --p-feat-ext--stop-words null \
  --p-feat-ext--strip-accents null \
  --p-feat-ext--token-pattern '(?u)\b\w\w+\b' \
  --p-feat-ext--tokenizer null \
  --p-no-verbose \
  --o-classifier classifier-0.qza

3.I used the same ASV representative sequences for taxnomic assignment with two classifier. DB-silva138 classifier returned only archaeal results, whereas the DB-silva138.1 classifier provided 100% of the expected results.

(1) This is the classification results using DB-silva138 classifier
DB-silva138_autoOri.qzv (1.3 MB)

qiime feature-classifier classify-sklearn \
--i-classifier DB-silva138-classifier.qza \
--i-reads rep-seqs_ccs.qza \
--p-n-jobs 50 \
--p-read-orientation 'auto' \
--o-classification DBsilav138_autoOri.qza

(2) This is the classification results using DB-silva138.1 classifier
DB-silva138.1_autoOri.qzv (1.2 MB)

qiime feature-classifier classify-sklearn \
--i-classifier DBsilva138.1-classifier.qza \
--i-reads rep-seqs_ccs_srr9089357.qza \
--p-n-jobs 50 \
--p-read-orientation 'auto' \
--o-classification DBsilva138.1_autoOri.qza

I confimed the ASV representative sequences have the same orientation with DB-silva138 and DB-silva138.1 using usearch program. The usearch result confirmed the ASV sequences are " 100 plus (100.0%), 0 minus (0.0%), 0 undet. (0.0%)"
The DB-silva138 dataset contains 436,680 sequences, while the DB-silva138.1 dataset contains 435,518 sequences.
I also confirmed that 430,035 sequences (over 98%) overlap between two datasets, and the sequences are identical.

When I set the parameter --p-read-orientation 'same' used with DB-silva138 classifier, I was able to obtain the same expected result from DB-silva138.1 classifer.

My question is: considering that the two classifiers were trained on nearly identical reference sequences, why do the classification results differ so drastically when --p-read-orientation is set to 'auto'?

The reason I want to understand this issue is that it directly impacts future analyses, specifically whether I need to adjust the --p-read-orientation setting during taxonomic assignments.

SoilRotifer · January 8, 2025, 2:48pm

Hi @mammerlin,

The following forum threads should address why you are observing this outcome:

If you know that your reads are in the same orientation as the classifier, then I agree, it is best to simply set --p-read-orientation 'same'.

-Cheers!

mammerlin · January 8, 2025, 4:49pm

Hi @SoilRotifer
Thank you for your response.

I did search the forum for similar topics, which helped me resolve the issue by setting --p-read-orientation 'same'. However, this still does not address my main question.

Unless it is necessary to check the orientation of ASV sequences every time before performing taxonomic assignment—which I don’t believe is a standard or general procedure.

As I mentioned, the two classifiers were trained on nearly identical reference datasets. Why does one classifier (DB-silva138.1) provide the expected results with the --p-read-orientation 'auto' parameter, while the other does not?

Additionally, I have confirmed that both datasets contain approximately 5% of sequences with reverse strands. I must emphasize that this applies to both datasets.

SoilRotifer · January 8, 2025, 6:03pm

As highlighted in the threads I linked above the answer is:

From the first post:

The default for this is --p-read-orientation auto . For sklearn to work properly the reads must be in the same orientation as the reference database / classifier. Sometimes bad sequences can trick the orientation detector to map the reads as the reverse compliment, thus returning poor taxonomic assignment.

and from the second post:

The classifier determines the orientation of reads (relative to the reference database) by polling the first 100 or so seqs.

Given what is mentioned in the second post is likely why there are differences when using auto. Note "bad" can also simply mean that the sequences (which might actually be good), but are flipped, tricking the orientation detector.

Does this help?

Many users have information on how their sequence data are generated from the sequencing facility, and know what to provide for --p-read-orientation. Most of the time same and auto work. But sadly... as you can se, not always.

Not all sequencing facilities follow the same sequencing procedure. Some may provide the R1 and R2 reads flipped, or mixed. Though this is not as common, but it does occur. This is why we provide the --p-read-orientation option, so users can force the read direction that works best given their data. Though it is more of an issue when reads are randomly flipped in the output. In which case, users may need to other tools to orient their reads prior analysis.

mammerlin · January 9, 2025, 2:33am

Hi @SoilRotifer
Thank you for your kind response. However, the two postes did not answer my question.

Let me clarify my question again:
I performed qiime feature-classifier classify-sklearn on the same ASV set using two classifiers. The ASV sequences have the same orientation with respect to both classifiers.

I have already confirmed that the two reference datasets used to train the classifiers contain almost identical sequences (>98%).
I also verified that both reference datasets used to train the classifiers contain approximately 5% reverse-strand sequences.

Do the two reference datasets have mixed-orientation sequences? Yes, both of them do.

Given that the two classifiers are trained on nearly identical datasets, why do they fail to produce the same results?

SoilRotifer · January 9, 2025, 3:39am

Hi @mammerlin,

I re-read your post again, and I think I quite forgot about the part comparing the 138.1 and 138 reverence databases. I apologize for that, not sure how I missed it!

So to make sure I am following your question I'll rephrase for my understanding...

You appear to be getting reasonable results with SILVA 138.1 when using --p-read-orientation 'auto'. But you only obtain equivalent results with SILVA 138 when using --p-read-orientation 'same', and not auto Thus, you are asking why the difference in classification results on the same ASVs given the impact of --p-read-orientation?

Looking at the provenance graphs for the two ...autoOri.qzv files make me think something is wrong. They do not look quite right, and are quite different from one another in terms of the portion of the graph that describes database generation, they should be the same between the two. So you are comparing apples to oranges here. Given your provenance the reference databases have been processed differently, (they do not appear to generally match the files on the data resources page) and with different versions of QIIME 2 throughout the provenance. One starts out at 2020.2 and the other starts with 2024.5. Which is odd as RESCRIPt was not 'officially' around until 2020.6 for making the data files (though I think it was possible to install RESCRIPt into the older versions of QIIME 2 at the time). Anyway, that is neither here nor there...

I'd remake the 138 and 138.1 databases using the same exact approach, using with the latest version of QIIME 2 (2024.10), and see if you observe the same output. We'll be able to confirm this as all of the RESCRIPt generated reference database information will be copied to your analysis provenance. So, the recorded steps should identical for that portion of the graph.

I'd like to investigate this further. Would you be willing to DM me links to the ASVs you're trying to classify? It's difficult to help without looking at the data and process it from our end.

mammerlin · January 9, 2025, 4:02am

Hi @SoilRotifer ,
I really appreciate your patience.

Yes, this is exactly what I am asking.

I will DM you the ASVs I used. Thank you very much.

SoilRotifer · January 9, 2025, 11:41pm

Hi @mammerlin,

Thank you for sharing the ASV files with me. From what I can tell, there was in issue with the classifiers you were using. I remade three classifiers using qiime2-amplicon-2024.10 for SILVA 138, 138.1, and 138.2.

Below are some example commands I ran to classify your ASVs for each SILVA classifier:

qiime feature-classifier classify-sklearn \
    --i-classifier q2-2024.10-silva-138.2/silva_full_classifier.qza \
    --i-reads b54f5f9bba993d1b9cc0548f3d7378e737c046ff.qza \
   --p-read-orientation 'same' \
    --o-classification silva_138.2_classified_same.qza

qiime metadata tabulate \
    --m-input-file silva_138.2_classified_same.qza \
    --o-visualization silva_138.2_classified_same.qzv

Here are the taxonomy visualizations using the default --p-read-orientation 'auto':
silva_138_classified.qzv (1.2 MB)
silva_138.1_classified.qzv (1.2 MB)
silva_138.2_classified.qzv (1.2 MB)

Here are the taxonomy visualizations using the default --p-read-orientation 'same':
silva_138_classified_same.qzv (1.2 MB)
silva_138.1_classified_same.qzv (1.2 MB)
silva_138.2_classified_same.qzv (1.2 MB)

As you can see everything returns identical taxonomic assignments except for minor taxonomic updates between versions. I did not have time to run through all the curation steps to while making the SILVA classifiers... so I just used RESCRIPt to import an dereplicate the full NR99 SILVA database prior to training the classifier. Then used the commands above to tabulate the taxonomic assignments.

As you can see the provenance for the SILVA reference database is identical across all the classification runs except for the version pulled and the --p-read-orientation setting.

So, I think these files show that the inconstant reference database curation was the problem. If you remake the SILVA database from scratch, however you'd like to curate it, and then use that for your analysis, should fix the apparent --p-read-orientation issue .

Let us know if this was helpful. Also, keep us posted as to what you find.

-Mike

mammerlin · January 10, 2025, 8:27am

@SoilRotifer
Thank you for confirming.
In my initial analysis, I used SILVA138, which was downloaded from Data resources — QIIME 2 2024.10.1 documentation, and then repeated the steps qiime rescript cull-seqs, qiime rescript filter-seqs-length-by-taxon, and qiime rescript dereplicate. However, these three steps had already been performed once in the pre-formatted SILVA138. I repeated them again.

Could it be possible that the repeated processing influenced something, even though these processes did not affect the sequence numbers themselves?

Currently, I plan to train the SILVA138 classifier using the pre-formatted SILVA 138 dataset downloaded from Data resources — QIIME 2 2024.10.1 documentation, followed by training the classifier, which I believe is the method most people would use.

Afterward, I will use this classifier to perform taxonomic classification on my ASVs.

I will keep providing updates on the results.

SoilRotifer · January 10, 2025, 1:58pm

Hi @mammerlin,

I've not tested this out myself, but I think this may be the issue. All I can determine is that the provenance information of the files from the Data Resources page do not match the provenance in the files you shared. But yes, this would be my guess too.

We provide these files as a convenience so people can more easily get started with there analysis. Quite a few users do make use of RESCRIPt to curate the SILVA and reference database files. There is no one right way to curate these files, so feel free to skip or re-arrange some steps as needed. The key is to make sure that the same classifier is used when comparing multiple data sets.

Please do! Good luck with your analyses!

mammerlin · January 14, 2025, 3:02am

Hi @SoilRotifer
I trained the classifier using the pre-formatted SILVA138 dataset downloaded directly from the Data resources — QIIME 2 2024.10.1 documentation.

You can download the trained classifier from the following link:
Download SILVA-138-99-tax-classifier.qza.

However, when using this classifier, I still obtained unexpected taxonomy results with the parameter --p-read-orientation 'auto'. Here is the classification result:
CCS-silva-138-99-tax.qzv (1.2 MB)

Is it possible that there is an issue with the pre-formatted SILVA138 dataset from the website, and could it need replacement?

Nicholas_Bokulich · January 14, 2025, 3:50pm

Hi @mammerlin ,

Just to jump in with my two cents:

No, this does not suggest any issue with the pre-formatted SILVA dataset, per se.

On the one hand, this is why the read-orientation parameter exists — the auto-detect method is not perfect and can mis-interpret, especially if the input data or reference data contain any noisy reads. If you know the typical read orientation of your reads, then you can just set that orientation with each run.

On the other hand, no reference database is perfect, but how much curation is required depends on the needs of individual researchers. SILVA does contain some noisy sequences, and the pre-formatted database provided on the QIIME 2 website has only undergone some basic filtering — it's up to researchers to decide on the level and type of pre-processing that is necessary before using a database. This is why we have provided various curation tools in RESCRIPt, so that you can do further cleaning.

The version provided works for most users without issue, but it seems like your data + that version of the database is causing the orientation detector to misfire. This is unfortunate but does not necessarily imply that there is an issue with the database itself, as this is an isolated case.

I also see another point here that could explain why you are seeing some unexpected results: it looks like you are using PacBio CCS data. For full-length 16S? SILVA does contain some fragmentary 16S reference sequences (one reason why we recommend filtering by length in that tutorial) and depending on the primer pair that you are using, portions of your read might even be longer than some reference sequences. This would lead to detection of kmers that might only be present in some reference sequences but not others, which might lead to the confused orientation. You could run extract-reads as in the tutorial (or the similar action in RESCRIPt) but with your primer pair and check how many reference sequences remain...

I hope that helps!

SoilRotifer · January 14, 2025, 8:54pm

Hi @mammerlin,

I just wanted to let you know that that a few of us are having an offline discussion about this. We'll keep you posted.

-Mike

mammerlin · January 15, 2025, 7:30am

Hi @SoilRotifer and @Nicholas_Bokulich,
Thank you for your response and patience.
I look forward to hearing further updates.

SoilRotifer · January 15, 2025, 3:27pm

HI @mammerlin,

I did get the same taxonomic classifications as you did using the older 138 SILVA database. But when I used SILVA v138.2, processed similarly as the 138 (i.e. cull-seqs, filter-seqs-length-by-taxon, then dereplicate), I obtained valid hits, as before.

CCS-silva-138.2-classified.qzv (1.3 MB)

Note: there is only one minor difference within the provenance between how I processed and what is provided on the data resources page: I manually ran reverse-transcribe after pulling the database and then ran cull-seqs... which I did not need to do as cull-seqs will do the reverse transcribing for you. But I was trying out a few other things at the time. Other than that the curation is identical.

I also trimmed your reads like so for the 138 and 138.2:

qiime feature-classifier extract-reads \
    --i-sequences CCS.qza \
    --p-f-primer CCTACGGGNGGCWGCAG \
    --p-r-primer CGGTGTGTACAAGGCCCGGGAACG \
    --o-reads CCS-V3V8-trimmed.qza

and classified them:

qiime feature-classifier classify-sklearn \
    --i-classifier silva-138.2-classifier.qza \
    --i-reads CCS-V3V8-trimmed.qza  \
    --o-classification CCS-V3V8-trimmed-classified.qza

CCS-V3V8-trimmed-classified.qzv (1.3 MB)

Again, for the 138.0 db the classification was erroneous. But for 138.2 the classification seemed sane. This tells me that it is a combination of what reference reads are available within the reference database, used to assess orientation, and how these reference reads are curated are contributing to what we see.

I find it interesting that skipping the cull-seqs, filter-seqs-length-by-taxon steps seemed to result in a good classification for 138.0, but when applying them they result in erroneous classifications with respect to your PacBio data. I've not observed this before, but then again every data set is different. This does not appear to be the case for 138.2 ( I assume 138.1 will be identical as the only thing different between 138.1 and 138.2 is the taxon labels).

This is why we stress that users should be cognizant of their data and what form it is in. For example... knowing the read direction of your data and that of the reference database, data type (i.e. illumina, pacbio, etc...), among other things. As implied in my observations above database curation is sometimes not trivial. As with any analysis, any decisions made during the curation process can alter taxonomic classification results.

Again, if you know the data being generated are going to be in the 5'-3' direction then it is good practice to be explicit and set --p-read-orientation same, from a reproducibility standpoint anyway. I often do this myself.

Otherwise, if your PacBio output is generating reads in mixed orientation, you can try running rescript orient-seqs on your ASVs. Then you can set --p-read-orientation same during classification. This should help your reads being oriented to the reference database, and you should be good to go. But again, not everything is perfect, and there may be orientation issues there too. I do not have much experience with PacBio data, so you may need to experiment a little.

My recommendation would be to keep things simple, use the latest SILVA 138.2 database, and set --p-read-orientation same.

mammerlin · January 16, 2025, 4:12am

Hi @SoilRotifer,
I truly appreciate the detailed and thorough evaluation you provided.

I couldn’t agree with you more—this would definitely be my first choice for my other analyses.

system · February 16, 2025, 10:12am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.