QIIME2 Naive Bayesian Classifier has different outputs (V3-V4)


I have built a new naive bayesian classifier using RESCRPt for the V3-V4 variable regions using the custom primers provided to us from our sequencing centre and we are using the SILVA138 reference. We are trying to validate their findings by recreating the classifier they have though are getting very different results.

To put into context - we are using the same clean.fasta files the only different is the aligner.

They directed us towards the marker gene sets on the QIIME2 help page so I mainly have two key questions;

  1. I used the RESCRIPt tutorial to make a custom V3-V4 classifier - here I think in the first step to get the whole database and then go through the steps to then make a specific classifier. Should I be using V4 specific references instead?
  2. The custom classifier I made has come up with strange OTUs such as a high number of Chloroplasts and Unclassified which is not in their OTU table - am I missing a clean up step somewhere?
  1. Get SILVA database:
    qiime rescript get-silva-data
    --p-version '138'
    --p-target 'SSURef_NR99'
    --o-silva-sequences outputs/silva-138-ssu-nr99-seqs.qza
    --o-silva-taxonomy outputs/silva-138-ssu-nr99-tax.qza

  2. “Culling” low-quality sequences with cull-seqs:
    qiime rescript cull-seqs
    --i-sequences outputs/silva-138-ssu-nr99-seqs.qza
    --o-clean-sequences outputs/silva-138-ssu-nr99-seqs-cleaned.qza

  3. Filtering sequences by length and taxonomy:
    qiime rescript filter-seqs-length-by-taxon
    --i-sequences outputs/silva-138-ssu-nr99-seqs-cleaned.qza
    --i-taxonomy outputs/silva-138-ssu-nr99-tax.qza
    --p-labels Archaea Bacteria Eukaryota
    --p-min-lens 900 1200 1400
    --o-filtered-seqs outputs/silva-138-ssu-nr99-seqs-filt.qza
    --o-discarded-seqs outputs/silva-138-ssu-nr99-seqs-discard.qza

  4. Dereplicating in uniq mode:
    qiime rescript dereplicate
    --i-sequences outputs/silva-138-ssu-nr99-seqs-filt.qza
    --i-taxa outputs/silva-138-ssu-nr99-tax.qza
    --p-rank-handles 'silva'
    --p-mode 'uniq'
    --o-dereplicated-sequences outputs/silva-138-ssu-nr99-seqs-derep-uniq.qza
    --o-dereplicated-taxa outputs/silva-138-ssu-nr99-tax-derep-uniq.qza

  5. Make a classifier for full length:
    qiime feature-classifier fit-classifier-naive-bayes
    --i-reference-reads outputs/silva-138-ssu-nr99-seqs-derep-uniq.qza
    --i-reference-taxonomy outputs/silva-138-ssu-nr99-tax-derep-uniq.qza
    --o-classifier silva-138-ssu-nr99-classifier.qza

  6. Make amplicon-region specific classifier:
    qiime feature-classifier extract-reads
    --i-sequences outputs/silva-138-ssu-nr99-seqs-derep-uniq.qza
    --p-f-primer CCTAYGGGRBGCASCAG
    --p-n-jobs 2
    --p-read-orientation 'forward'
    --o-reads silva138-nr99-seqs-16S-V3-V4.qza

qiime rescript dereplicate
--i-sequences silva138-nr99-seqs-16S-V3-V4.qza
--i-taxa outputs/silva-138-ssu-nr99-tax-derep-uniq.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences outputs/silva-138-nr99-seqs-16S-V3-V4-uniq.qza
--o-dereplicated-taxa outputs/silva-138-nr99-tax-16S-V3-V4-derep-uniq.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads outputs/silva-138-nr99-seqs-16S-V3-V4-uniq.qza
--i-reference-taxonomy outputs/silva-138-nr99-tax-16S-V3-V4-derep-uniq.qza
--o-classifier silva-138-nr99-16S-V3-V4-classifier.qza

I have my code above

Thank you so much for your help!


Looks like you trained the classifier based on the right region since you used primers from the library.

Probably they performed additional cleaning step after taxonomy annotation.
You can remove those sequences by filtering feature tables:

qiime taxa filter-table \
  --i-table feature-table.qza \
  --i-taxonomy taxonomy.qza \
  --p-include p__ \
  --p-exclude mitochondria,chloroplast \
  --o-filtered-table filtered-table.qza

This one command will:

  • remove any sequence that is not assigned to at least phylum level (such as Bacteria, Archaea, Unassigned - they all do not have "p__" in the annotation).
  • remove organelles.

You can modify it if necessary to change the behavior.

If you also want to get rid of those sequences in your rep-seq file, you can use:

qiime feature-table filter-seqs \
  --i-data rep-seqs.qza \
  --i-table filtered-table.qza \
  --o-filtered-data filtered-rep-seqs.qza
1 Like

Ah brilliant - that makes sense!

I will give that a go now and then update the post!

If possible I wanted to ask another question - the read counts from the sequencing company are almost 10X higher if not more

I have posted a screen grab of both tables
This first is my output and the second is from the sequencing company

I would check the stats you got after the denoising step (did you use dada2?). How many reads are retained as a % of the input reads?

1 Like

Hi there,

So I just checked this and I have over 90% retained as input reads after DADA2 step.

Because I am using clean fasta files I have not conducted any trimming as shown below

qiime dada2 denoise-single
--i-demultiplexed-seqs artifacts/Reads.qza
--p-trunc-len 0
--p-trim-left 0
--p-n-threads 4
--o-table artifacts/Feature_Table.qza
--o-representative-sequences artifacts/Feature_Sequences.qza
--o-denoising-stats artifacts/Feature_Statistics.qza

That's a little bit strange. Based on the screenshots, you have less reads in your feature table. As I understand, you shared a screenshot of collapsed to the genus data? Did you remove primers from the reads before Dada2? Primers in the sequence may influence taxonomy annotation.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.