Issue with qiime rescript evaluate-fit-classifier - wrong classification

Blanka_Roje · July 23, 2024, 3:55pm

Hi!

I have built a rbcl classfier just like the instructions in the tutorial from Mike Robeson (Using RESCRIPt's 'extract-seq-segments' to extract reference sequences without PCR primer pairs. - #11 by John).

My code:

conda activate qiime2-amplicon-2024.2

qiime rescript get-ncbi-data \
    --p-query "txid2836[ORGN] AND (rbcl[All Fields] OR rubisco[All Fields] OR ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit[All Fields]) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]" \
    --p-ranks  kingdom phylum class order family genus species \
    --p-rank-propagation \
    --p-n-jobs 5 \
    --o-sequences NCBI-diatoms-rbcL-ref-seqs.qza \
    --o-taxonomy NCBI-diatoms-rbcL-ref-tax.qza \
    --verbose

# dereplicate sequences
qiime rescript dereplicate \
    --i-sequences NCBI-diatoms-rbcL-ref-seqs.qza \
    --i-taxa NCBI-diatoms-rbcL-ref-tax.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences NCBI-diatoms-rbcL-ref-seqs-derep.qza \
    --o-dereplicated-taxa NCBI-diatoms-rbcL-ref-tax-derep.qza

# visualise DB without any changes

    qiime feature-table tabulate-seqs \
        --i-data NCBI-diatoms-rbcL-ref-seqs.qza \
        --o-visualization NCBI-diatoms-rbcL-ref-seqs.qzv

# visualise the DB after derep

    qiime feature-table tabulate-seqs \
    --i-data NCBI-diatoms-rbcL-ref-seqs-derep.qza \
    --o-visualization NCBI-diatoms-rbcL-ref-seqs-derep.qzv


# making initial reference sequence pool (reference sequences of the exact amplicon to be blasted against the rest)
# rev primer form as for PCR and sequencing 
qiime feature-classifier extract-reads \
   --i-sequences NCBI-diatoms-rbcL-ref-seqs-derep.qza \
   --p-f-primer  AGGWGAAGYWAAAGGTTCWTAYTTAAA \
   --p-r-primer  CCTTCTAAYTTACCWACWACWG \
   --p-n-jobs 8 \
   --o-reads NCBI-diatoms-rbcL-ref-primer-segments.qza

qiime feature-table tabulate-seqs \
    --i-data NCBI-diatoms-rbcL-ref-primer-segments.qza \
    --o-visualization NCBI-diatoms-rbcL-ref-primer-segments.qzv


# cleaning the initial reference sequence pool

qiime rescript dereplicate \
    --i-sequences NCBI-diatoms-rbcL-ref-primer-segments.qza \
    --i-taxa NCBI-diatoms-rbcL-ref-tax-derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences NCBI-diatoms-rbcL-ref-primer-segments-derep.qza \
    --o-dereplicated-taxa NCBI-diatoms-rbcL-ref-tax-primer-segments-derep.qza
    
qiime rescript cull-seqs \
    --i-sequences NCBI-diatoms-rbcL-ref-primer-segments-derep.qza  \
    --p-n-jobs 8 \
    --p-num-degenerates 1 \
    --p-homopolymer-length 8 \
    --o-clean-sequences NCBI-diatoms-rbcL-ref-primer-segments-derep-cull.qza

qiime feature-table tabulate-seqs \
  --i-data NCBI-diatoms-rbcL-ref-primer-segments-derep-cull.qza \
  --o-visualization NCBI-diatoms-rbcL-ref-primer-segments-derep-cull.qzv


# First iteration of query sequence segment extraction

    qiime rescript extract-seq-segments \
        --i-input-sequences NCBI-diatoms-rbcL-ref-seqs-derep.qza \
        --i-reference-segment-sequences NCBI-diatoms-rbcL-ref-primer-segments-derep-cull.qza \
        --p-perc-identity 0.7 \
        --p-min-seq-len 10 \
        --p-threads 8 \
        --o-extracted-sequence-segments NCBI-diatoms-rbcL-extracted-sequences-01.qza \
        --o-unmatched-sequences NCBI-diatoms-rbcL-unmatched-sequences-01.qza \
        --verbose

#cleaning the first iteration

    qiime rescript dereplicate \
        --i-sequences NCBI-diatoms-rbcL-extracted-sequences-01.qza \
        --i-taxa NCBI-diatoms-rbcL-ref-tax-derep.qza \
        --p-mode 'uniq' \
        --p-threads 8 \
        --o-dereplicated-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-01.qza \
        --o-dereplicated-taxa NCBI-diatoms-rbcL-extracted-tax-segments-derep-01.qza

    qiime rescript cull-seqs \
        --i-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-01.qza \
        --p-n-jobs 8 \
        --p-num-degenerates 1 \
        --p-homopolymer-length 8 \
        --o-clean-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-01.qza

    qiime feature-table tabulate-seqs \
    --i-data NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-01.qza \
    --o-visualization NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-01.qzv


# Second iteration

    qiime rescript extract-seq-segments \
        --i-input-sequences NCBI-diatoms-rbcL-ref-seqs-derep.qza \
        --i-reference-segment-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-01.qza \
        --p-perc-identity 0.7 \
        --p-min-seq-len 10 \
        --p-threads 8 \
        --o-extracted-sequence-segments NCBI-diatoms-rbcL-extracted-seq-segments-02.qza \
        --o-unmatched-sequences NCBI-diatoms-rbcL-unmatched-sequences-02.qza \
        --verbose

# cleaning the second iteration

    qiime rescript dereplicate \
        --i-sequences NCBI-diatoms-rbcL-extracted-seq-segments-02.qza \
        --i-taxa NCBI-diatoms-rbcL-ref-tax-derep.qza \
        --p-mode 'uniq' \
        --p-threads 8 \
        --o-dereplicated-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-02.qza \
        --o-dereplicated-taxa NCBI-diatoms-rbcL-extracted-tax-segments-derep-02.qza

    qiime rescript cull-seqs \
        --i-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-02.qza \
        --p-n-jobs 8 \
        --p-num-degenerates 1 \
        --p-homopolymer-length 8 \
        --o-clean-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-02.qza

    qiime feature-table tabulate-seqs \
    --i-data NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-02.qza \
    --o-visualization NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-02.qzv


# filter by length after being done with matching


qiime rescript filter-seqs-length \
    --i-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-02.qza \
    --p-global-min 189 \
    --p-global-max 264 \
    --o-filtered-seqs NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-keep-02.qza \
    --o-discarded-seqs NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-discard-02.qza

qiime rescript filter-taxa \
    --i-taxonomy NCBI-diatoms-rbcL-ref-tax-derep.qza \
    --m-ids-to-keep-file NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-keep-02.qza \
    --o-filtered-taxonomy NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-02.qza

# look

qiime metadata tabulate \
    --m-input-file NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-02.qza \
    --o-visualization NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-02.qzv
    
qiime rescript evaluate-taxonomy \
    --i-taxonomies NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-02.qza \
    --o-taxonomy-stats NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-eval-02.qzv

qiime rescript evaluate-seqs \
    --i-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-keep-02.qza \
    --p-kmer-lengths 16 8 4 2 \
    --o-visualization NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-keep-eval-02.qzv


# build a classifier

qiime rescript evaluate-fit-classifier \
    --i-sequences NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-keep-02.qza  \
    --i-taxonomy NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-02.qza \
    --p-n-jobs 2 \
    --o-classifier NCBI-diatoms-rbcL-refseqs-classifier.qza \
    --o-evaluation NCBI-diatoms-rbcL-refseqs-classifier-evaluation.qzv \
    --o-observed-taxonomy NCBI-diatoms-rbcL-refseqs-predicted-taxonomy.qza


#vizualizing the taxonomy
qiime metadata tabulate \
  --m-input-file NCBI-diatoms-rbcL-refseqs-predicted-taxonomy.qza \
  --o-visualization NCBI-diatoms-rbcL-refseqs-predicted-taxonomy.qzv


qiime rescript evaluate-taxonomy \
  --i-taxonomies NCBI-diatoms-rbcL-extracted-tax-segments-derep-cull-keep-02.qza NCBI-diatoms-rbcL-refseqs-predicted-taxonomy.qza \
  --p-labels ref-taxonomy predicted-taxonomy \
  --o-taxonomy-stats NCBI-diatoms-rbcL-ref-taxonomy-evaluation.qzv

I wonder why the classifier is misclassifying one sequence that is part of its reference pool. The sequence in question is KP757863.1. It is part of the last reference sequence pool used to build the classifier (NCBI-diatoms-rbcL-extracted-seq-segments-derep-cull-keep-02.qza). It belongs to P. linea, while the classifier classifies it as P. americana with a 0.75 confidence.

I have the same sequence in my samples and the same wrong classification occurs.

Do you have any recommendations for changing the classifier parameters to get a better classification?

Thank you in advance!
Blanka

SoilRotifer · July 23, 2024, 7:18pm

Hi @Blanka_Roje,

Well, let's try to figure this out... Your pipeline looks sane to me. Good work!

A few questions come to mind:

It appears that large portions of the sequences for these taxa are quite similar. If there are multiple unique sequence records that differ by only a few base pairs, yet have the same (or sometimes different) taxonomy ... it could be a reference database bias issue. That is, if the classifier is hitting one set of reference taxa more than the other (because there are more entries), it will classify to that taxa. Though I am assuming there are multiple unique sequences for one taxa over the other, which I could be wrong. You'll want to confirm this.
a) Remember, the classifier is as only as good as the reference data you feed it.
b) Perhaps run qiime metadata tabulate ...on your reference taxonomy file and see how many references you have for each of these taxa?
How well do you trust the taxonomic label of the reference sequences?
a) remember the taxonomy is not curated by GenBank, but the original submitter of the sequence record. Thus, some could be erroneous. Perhaps there is a good phylogeny out there that can help?
When you use the BLAST website to query P. linea does it return a mix of taxa as top hits?
a) assuming you are using megablast

If anyone else on the forum as any thoughts, please feel free to jump in!

-Mike

Blanka_Roje · July 24, 2024, 2:50pm

hi @SoilRotifer ,

thank you for your help!

Indeed, the two sequences are the same in the amplicon segment. I don't know how I missed this when I blasted the sequence (also answer to your third question - yes).

I have one sequence of P. linea and two of P. americana in the reference database, I suppose that is why it assigned it to P. americana.

Is there a way to make the classifier in this case assign to genus level only?

Thank you again!
Blanka

SoilRotifer · July 24, 2024, 3:48pm

You'll have to do this prior to training the classifier. There are two ways:

When running rescript get-ncbi-data you can set the taxonomic ranks you'd like to extract via the --p-ranks option. The default is to pull kingdom phylum class order family genus species. In your case you can use: --p-ranks kingdom phylum class order family genus
Perhaps more easily, you can simply make use of rescript edit-taxonomy. That is, your command to remove the species labels may look something like this:

qiime rescript edit-taxonomy  \
    --i-taxonomy NCBI-diatoms-rbcL-ref-tax.qza \
    --p-use-regex \
    --p-search-strings  's__.*' \
    --p-replacement-strings ''  \
    --o-edited-taxonomy NCBI-diatoms-rbcL-ref-tax-genus-only.qza

Then use NCBI-diatoms-rbcL-ref-tax-genus-only.qza for all your downstream curation and classifier training steps. I hope I got the regex correct, but you can play around with it.

-Cheers!

system · August 24, 2024, 9:49pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.