Hi,
I got unexpected taxonomic assignment results from two classifiers trained on nearly identical sequences. Even I found the way to solve the problem (which is setting the --p-read-orientation same
in q2-feature-classifier classify-sklearn), I really want to know the reason caused the unexpected results.
I used two classifier which were trained using silva 138 and 138.1, respectively.
1.DB-silva138 was downloaded from QIIME2 webiste's data resource (Data resources — QIIME 2 2024.10.1 documentation).
- Silva 138 SSURef NR99 full-length sequences (MD5:
de8886bb2c059b1e8752255d271f3010
) - Silva 138 SSURef NR99 full-length taxonomy (MD5:
f12d5b78bf4b1519721fe52803581c3d
)
After downloaded DB-silva138, I also processed using RESCRIPt. Here is the provenance of DB-silva138 classifier:
qiime rescript get-silva-data \
--p-version 138 \
--p-target SSURef_NR99 \
--p-include-species-labels \
--p-download-sequences \
--o-silva-taxonomy silva-taxonomy-0.qza \
--o-silva-sequences silva-sequences-0.qza
qiime rescript cull-seqs \
--i-sequences silva-sequences-0.qza \
--p-num-degenerates 5 \
--p-homopolymer-length 8 \
--o-clean-sequences clean-sequences-0.qza
qiime rescript filter-seqs-length-by-taxon \
--i-sequences clean-sequences-0.qza \
--i-taxonomy silva-taxonomy-0.qza \
--p-labels Archaea Bacteria Eukaryota \
--p-min-lens 900 1200 1400 \
--o-filtered-seqs filtered-seqs-0.qza \
--o-discarded-seqs XX_discarded_seqs
qiime rescript dereplicate \
--i-sequences filtered-seqs-0.qza \
--i-taxa silva-taxonomy-0.qza \
--p-mode uniq \
--p-perc-identity 1.0 \
--p-threads 1 \
--p-rank-handles silva \
--p-no-derep-prefix \
--o-dereplicated-sequences dereplicated-sequences-0.qza \
--o-dereplicated-taxa dereplicated-taxa-0.qza
qiime rescript cull-seqs \
--i-sequences dereplicated-sequences-0.qza \
--p-num-degenerates 5 \
--p-homopolymer-length 8 \
--p-n-jobs 1 \
--o-clean-sequences clean-sequences-1.qza
qiime rescript filter-seqs-length-by-taxon \
--i-sequences clean-sequences-1.qza \
--i-taxonomy dereplicated-taxa-0.qza \
--p-labels Archaea Bacteria Eukaryota \
--p-min-lens 900 1200 1400 \
--o-filtered-seqs filtered-seqs-1.qza \
--o-discarded-seqs XX_discarded_seqs
qiime rescript dereplicate \
--i-sequences filtered-seqs-1.qza \
--i-taxa dereplicated-taxa-0.qza \
--p-mode uniq \
--p-perc-identity 1.0 \
--p-threads 1 \
--p-rank-handles domain phylum class order family genus species \
--p-no-derep-prefix \
--o-dereplicated-sequences dereplicated-sequences-1.qza \
--o-dereplicated-taxa dereplicated-taxa-1.qza
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads dereplicated-sequences-1.qza \
--i-reference-taxonomy dereplicated-taxa-1.qza \
--p-classify--alpha 0.001 \
--p-classify--chunk-size 20000 \
--p-classify--class-prior null \
--p-no-classify--fit-prior \
--p-no-feat-ext--alternate-sign \
--p-feat-ext--analyzer char_wb \
--p-no-feat-ext--binary \
--p-feat-ext--decode-error strict \
--p-feat-ext--encoding utf-8 \
--p-feat-ext--input content \
--p-feat-ext--lowercase \
--p-feat-ext--n-features 8192 \
--p-feat-ext--ngram-range '[7, 7]' \
--p-feat-ext--norm l2 \
--p-feat-ext--preprocessor null \
--p-feat-ext--stop-words null \
--p-feat-ext--strip-accents null \
--p-feat-ext--token-pattern '(?u)\b\w\w+\b' \
--p-feat-ext--tokenizer null \
--p-no-verbose \
--o-classifier classifier-0.qza
In this provenance, the steps of cull-seqs, filter-seqs-length-by-taxon and dereplicate were performed twice. I believe the first execution was carried out by the QIIME2 team before I downloaded the data.
2.DB-silva138.1 was downloaded followed by RESCRIPt tutorial (Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt), and here is the provenance of DB-silva138.1 classifier
qiime rescript get-silva-data \
--p-version 138.1 \
--p-target SSURef_NR99 \
--p-no-include-species-labels \
--p-rank-propagation \
--p-download-sequences \
--o-silva-sequences silva-sequences-0.qza \
--o-silva-taxonomy silva-taxonomy-0.qza
qiime rescript reverse-transcribe \
--i-rna-sequences silva-sequences-0.qza \
--o-dna-sequences dna-sequences-0.qza
qiime rescript cull-seqs \
--i-sequences dna-sequences-0.qza \
--p-num-degenerates 5 \
--p-homopolymer-length 8 \
--p-n-jobs 1 \
--o-clean-sequences clean-sequences-0.qza
qiime rescript filter-seqs-length-by-taxon \
--i-sequences clean-sequences-0.qza \
--i-taxonomy silva-taxonomy-0.qza \
--p-labels Archaea Bacteria Eukaryota \
--p-min-lens 900 1200 1400 \
--o-filtered-seqs filtered-seqs-0.qza \
--o-discarded-seqs XX_discarded_seqs
qiime rescript dereplicate \
--i-sequences filtered-seqs-0.qza \
--i-taxa silva-taxonomy-0.qza \
--p-mode uniq \
--p-perc-identity 1.0 \
--p-threads 1 \
--p-rank-handles domain phylum class order family genus species \
--p-no-derep-prefix \
--o-dereplicated-taxa dereplicated-taxa-0.qza \
--o-dereplicated-sequences dereplicated-sequences-0.qza
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads dereplicated-sequences-0.qza \
--i-reference-taxonomy dereplicated-taxa-0.qza \
--p-classify--alpha 0.001 \
--p-classify--chunk-size 20000 \
--p-classify--class-prior null \
--p-no-classify--fit-prior \
--p-no-feat-ext--alternate-sign \
--p-feat-ext--analyzer char_wb \
--p-no-feat-ext--binary \
--p-feat-ext--decode-error strict \
--p-feat-ext--encoding utf-8 \
--p-feat-ext--input content \
--p-feat-ext--lowercase \
--p-feat-ext--n-features 8192 \
--p-feat-ext--ngram-range '[7, 7]' \
--p-feat-ext--norm l2 \
--p-feat-ext--preprocessor null \
--p-feat-ext--stop-words null \
--p-feat-ext--strip-accents null \
--p-feat-ext--token-pattern '(?u)\b\w\w+\b' \
--p-feat-ext--tokenizer null \
--p-no-verbose \
--o-classifier classifier-0.qza
3.I used the same ASV representative sequences for taxnomic assignment with two classifier. DB-silva138 classifier returned only archaeal results, whereas the DB-silva138.1 classifier provided 100% of the expected results.
(1) This is the classification results using DB-silva138 classifier
DB-silva138_autoOri.qzv (1.3 MB)
qiime feature-classifier classify-sklearn \
--i-classifier DB-silva138-classifier.qza \
--i-reads rep-seqs_ccs.qza \
--p-n-jobs 50 \
--p-read-orientation 'auto' \
--o-classification DBsilav138_autoOri.qza
(2) This is the classification results using DB-silva138.1 classifier
DB-silva138.1_autoOri.qzv (1.2 MB)
qiime feature-classifier classify-sklearn \
--i-classifier DBsilva138.1-classifier.qza \
--i-reads rep-seqs_ccs_srr9089357.qza \
--p-n-jobs 50 \
--p-read-orientation 'auto' \
--o-classification DBsilva138.1_autoOri.qza
-
I confimed the ASV representative sequences have the same orientation with DB-silva138 and DB-silva138.1 using usearch program. The usearch result confirmed the ASV sequences are " 100 plus (100.0%), 0 minus (0.0%), 0 undet. (0.0%)"
-
The DB-silva138 dataset contains 436,680 sequences, while the DB-silva138.1 dataset contains 435,518 sequences.
I also confirmed that 430,035 sequences (over 98%) overlap between two datasets, and the sequences are identical.
When I set the parameter --p-read-orientation 'same'
used with DB-silva138 classifier, I was able to obtain the same expected result from DB-silva138.1 classifer.
My question is: considering that the two classifiers were trained on nearly identical reference sequences, why do the classification results differ so drastically when --p-read-orientation
is set to 'auto'?
The reason I want to understand this issue is that it directly impacts future analyses, specifically whether I need to adjust the --p-read-orientation
setting during taxonomic assignments.