Taxonomic assignation using EZbiocloud

Hello, I'm quite new to qiime2 but I can already see what a wonderful too it is!

I'm currently working on a project involving metabarcoding analysis on bacterial community in fish with a focus on Tenacibaculum as it makes the fish sick and die.
Assigning the taxonomy with SILVA v138 wasn't satisfying at the specie level, so we tested the assignation with EZbiocloud on a few sequences first, on the website like so:

We are very pleased with the accuracy of the assignation, so I trained a classifier using EZbiocloud database:

DIR=/home1/datawork/sdarinot/02_Tenaci/01_my_analysis/EZbiocloud

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path $DIR/ezbiocloud_qiime_full.fasta \
  --output-path $DIR/ezbio.qza

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path $DIR/ezbiocloud_id_taxonomy.txt \
  --output-path $DIR/ref-taxonomy.qza

qiime feature-classifier extract-reads \
  --i-sequences $DIR/ezbio.qza \
  --p-f-primer GTGYCAGCMGCCGCGGTAA \
  --p-r-primer GGACTACNVGGGTWTCTAAT \
  --p-min-length 100 \
  --p-max-length 400 \
  --o-reads $DIR/ref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads $DIR/ref-seqs.qza \
  --i-reference-taxonomy $DIR/ref-taxonomy.qza \
  --p-classify--chunk-size 5000 \
  --o-classifier $DIR/EZbiocloud_classifier.qza

And then I used this newly trained classifier to assign taxonomy to my sequences:

DIR=/home1/datawork/sdarinot/02_Tenaci/01_my_analysis/EZbiocloud

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path $DIR/seq_Tenaci.fna \
  --output-path $DIR/seq_Tenaci.qza

qiime feature-classifier classify-sklearn \
  --i-classifier $DIR/EZbiocloud_classifier.qza \
  --i-reads $DIR/seq_Tenaci.qza \
  --o-classification $DIR/taxonomy.qza

And if we look at the results for the same sequences in the taxonomy.qza file, it is different, less accurate:

Why is there such a difference in the accuracy of the assignation?
Does it have to do with the way I trained the classifier? If so what can I change to get the same result as on the website?

I would be thankful if you have any lead.
Sophie

Hi @SophieD,

Do you know how the taxonomy assignment algorithm works within EzBioCloud? I've not used this tool before. I suspect it is similar to BLAST?

Many tools simply take the top BLAST hit to a given reference database. However, the top hit is not always correct, as that hit might be arbitrarily sorted to the top, despite having hundreds or thousands of equally likely hits listed below a given hit. For example, many organisms have the exact same sequence over a given sequenced region, and can not be disambiguated. The fit-classifier-naive-bayes take this into account and will return the lowest common ancestor (LCA) when multiple taxa have identical sequence.

For example see this thread:

I might also add that, it is very difficult to expect species-level classifications with short amplicon reads. There are even cases in which having the full length 16S rRNA gene sequences can not disambiguate between species or genera! :scream:

-Mike

3 Likes

Hi @SoilRotifer, thank you for your reply

EZbiocloud uses the VSEARCH program to assign taxonomy. To quote EZbiocloud's website:

"Dereplicated sequences are then subjected to taxonomic assignment. We use VSEARCH program (Rognes et al. 2016) to search and calculate sequence similarities of the query NGS reads against the EzBioCloud 16S database. 97% 16S similarity is used as the cutoff for species-level identification. Other sequence similarity cut-offs are used for genus or higher taxonomic ranks.

  • x = sequence similarity to reference sequences; species (x ≥ 97%), genus (97> x ≥94.5%), family (94.5> x ≥86.5%), order (86.5> x ≥82%), class (82> x ≥78.5%), and phylum (78.5> x ≥75%). Cutoff values are taken from Yarza et al. (2014).

To reduce computation and accuracy, we built different versions of reference 16S databases that match various regions of 16S sequences. For example, full-length version (V1-V9) is used for PacBio ccs data whereas the V3-V4 version is used for MiSeq 250 bp paired-end sequencing data."

Thank you for your explanation of the q2-feature-classifier, I have a better understanding of how it works.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.