Hi!
I have built a rbcl classfier just like the instructions in the tutorial from Mike Robeson (Using RESCRIPt's 'extract-seq-segments' to extract reference sequences without PCR primer pairs. - #11 by John).
My code:
conda activate qiime2-amplicon-2024.2
qiime rescript get-ncbi-data \
--p-query "txid2836[ORGN] AND (rbcl[All Fields] OR rubisco[All Fields] OR ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit[All Fields]) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]" \
--p-ranks kingdom phylum class order family genus species \
--p-rank-propagation \
--p-n-jobs 5 \
--o-sequences NCBI-diatoms-rbcL-ref-seqs.qza \
--o-taxonomy NCBI-diatoms-rbcL-ref-tax.qza \
--verbose
# dereplicate sequences
qiime rescript dereplicate \
--i-sequences NCBI-diatoms-rbcL-ref-seqs.qza \
--i-taxa NCBI-diatoms-rbcL-ref-tax.qza \
--p-mode 'uniq' \
--p-threads 8 \
--o-dereplicated-sequences NCBI-diatoms-rbcL-ref-seqs-derep.qza \
--o-dereplicated-taxa NCBI-diatoms-rbcL-ref-tax-derep.qza
# visualise DB without any changes
qiime feature-table tabulate-seqs \
--i-data NCBI-diatoms-rbcL-ref-seqs.qza \
--o-visualization NCBI-diatoms-rbcL-ref-seqs.qzv
# visualise the DB after derep
qiime feature-table tabulate-seqs \
--i-data NCBI-diatoms-rbcL-ref-seqs-derep.qza \
--o-visualization NCBI-diatoms-rbcL-ref-seqs-derep.qzv
# making initial reference sequence pool (reference sequences of the exact amplicon to be blasted against the rest)
# rev primer form as for PCR and sequencing
qiime feature-classifier extract-reads \
--i-sequences NCBI-diatoms-rbcL-ref-seqs-derep.qza \
--p-f-primer AGGWGAAGYWAAAGGTTCWTAYTTAAA \
--p-r-primer CCTTCTAAYTTACCWACWACWG \
--p-n-jobs 8 \
--o-reads NCBI-diatoms-rbcL-ref-primer-segments.qza
qiime feature-table tabulate-seqs \
--i-data NCBI-diatoms-rbcL-ref-primer-segments.qza \
--o-visualization NCBI-diatoms-rbcL-ref-primer-segments.qzv
# cleaning the initial reference sequence pool
qiime rescript dereplicate \
--i-sequences NCBI-diatoms-rbcL-ref-primer-segments.qza \
--i-taxa NCBI-diatoms-rbcL-ref-tax-derep.qza \
--p-mode 'uniq' \
--p-threads 8 \
--o-dereplicated-sequences NCBI-diatoms-rbcL-ref-primer-segments-derep.qza \
--o-dereplicated-taxa NCBI-diatoms-rbcL-ref-tax-primer-segments-derep.qza
qiime rescript cull-seqs \
--i-sequences NCBI-diatoms-rbcL-ref-primer-segments-derep.qza \
--p-n-jobs 8 \
--p-num-degenerates 1 \
--p-homopolymer-length 8 \
--o-clean-sequences NCBI-diatoms-rbcL-ref-primer-segments-derep-cull.qza
qiime feature-table tabulate-seqs \
--i-data NCBI-diatoms-rbcL-ref-primer-segments-derep-cull.qza \
--o-visualization NCBI-diatoms-rbcL-ref-primer-segments-derep-cull.qzv
# First iteration of query sequence segment extraction
qiime rescript extract-seq-segments \
--i-input-sequences NCBI-diatoms-rbcL-ref-seqs-derep.qza \
--i-reference-segment-sequences NCBI-diatoms-rbcL-ref-primer-segments-derep-cull.qza \
--p-perc-identity 0.7 \
--p-min-seq-len 10 \
--p-threads 8 \
--o-extracted-sequence-segments NCBI-diatoms-rbcL-extracted-sequences-01.qza \
--o-unmatched-sequences NCBI-diatoms-rbcL-unmatched-sequences-01.qza \
--verbose
#cleaning the first iteration
qiime rescript dereplicate \
--i-sequences NCBI-diatoms-rbcL-extracted-sequences-01.qza \
--i-taxa NCBI-diatoms-rbcL-ref-tax-derep.qza \
--p-mode 'uniq' \
--p-threads 8 \
--o-dereplicated-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-01.qza \
--o-dereplicated-taxa NCBI-diatoms-rbcL-extracted-tax-segments-derep-01.qza
qiime rescript cull-seqs \
--i-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-01.qza \
--p-n-jobs 8 \
--p-num-degenerates 1 \
--p-homopolymer-length 8 \
--o-clean-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-01.qza
qiime feature-table tabulate-seqs \
--i-data NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-01.qza \
--o-visualization NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-01.qzv
# Second iteration
qiime rescript extract-seq-segments \
--i-input-sequences NCBI-diatoms-rbcL-ref-seqs-derep.qza \
--i-reference-segment-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-01.qza \
--p-perc-identity 0.7 \
--p-min-seq-len 10 \
--p-threads 8 \
--o-extracted-sequence-segments NCBI-diatoms-rbcL-extracted-seq-segments-02.qza \
--o-unmatched-sequences NCBI-diatoms-rbcL-unmatched-sequences-02.qza \
--verbose
# cleaning the second iteration
qiime rescript dereplicate \
--i-sequences NCBI-diatoms-rbcL-extracted-seq-segments-02.qza \
--i-taxa NCBI-diatoms-rbcL-ref-tax-derep.qza \
--p-mode 'uniq' \
--p-threads 8 \
--o-dereplicated-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-02.qza \
--o-dereplicated-taxa NCBI-diatoms-rbcL-extracted-tax-segments-derep-02.qza
qiime rescript cull-seqs \
--i-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-02.qza \
--p-n-jobs 8 \
--p-num-degenerates 1 \
--p-homopolymer-length 8 \
--o-clean-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-02.qza
qiime feature-table tabulate-seqs \
--i-data NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-02.qza \
--o-visualization NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-02.qzv
# filter by length after being done with matching
qiime rescript filter-seqs-length \
--i-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-02.qza \
--p-global-min 189 \
--p-global-max 264 \
--o-filtered-seqs NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-keep-02.qza \
--o-discarded-seqs NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-discard-02.qza
qiime rescript filter-taxa \
--i-taxonomy NCBI-diatoms-rbcL-ref-tax-derep.qza \
--m-ids-to-keep-file NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-keep-02.qza \
--o-filtered-taxonomy NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-02.qza
# look
qiime metadata tabulate \
--m-input-file NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-02.qza \
--o-visualization NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-02.qzv
qiime rescript evaluate-taxonomy \
--i-taxonomies NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-02.qza \
--o-taxonomy-stats NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-eval-02.qzv
qiime rescript evaluate-seqs \
--i-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-keep-02.qza \
--p-kmer-lengths 16 8 4 2 \
--o-visualization NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-keep-eval-02.qzv
# build a classifier
qiime rescript evaluate-fit-classifier \
--i-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-keep-02.qza \
--i-taxonomy NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-02.qza \
--p-n-jobs 2 \
--o-classifier NCBI-diatoms-rbcL-refseqs-classifier.qza \
--o-evaluation NCBI-diatoms-rbcL-refseqs-classifier-evaluation.qzv \
--o-observed-taxonomy NCBI-diatoms-rbcL-refseqs-predicted-taxonomy.qza
#vizualizing the taxonomy
qiime metadata tabulate \
--m-input-file NCBI-diatoms-rbcL-refseqs-predicted-taxonomy.qza \
--o-visualization NCBI-diatoms-rbcL-refseqs-predicted-taxonomy.qzv
qiime rescript evaluate-taxonomy \
--i-taxonomies NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-02.qza NCBI-diatoms-rbcL-refseqs-predicted-taxonomy.qza \
--p-labels ref-taxonomy predicted-taxonomy \
--o-taxonomy-stats NCBI-diatoms-rbcL-ref-taxonomy-evaluation.qzv
I wonder why the classifier is misclassifying one sequence that is part of its reference pool. The sequence in question is KP757863.1. It is part of the last reference sequence pool used to build the classifier (NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-keep-02.qza). It belongs to P. linea, while the classifier classifies it as P. americana with a 0.75 confidence.
I have the same sequence in my samples and the same wrong classification occurs.
Do you have any recommendations for changing the classifier parameters to get a better classification?
Thank you in advance!
Blanka