Hello, I'm quite new to qiime2 but I can already see what a wonderful too it is!
I'm currently working on a project involving metabarcoding analysis on bacterial community in fish with a focus on Tenacibaculum as it makes the fish sick and die.
Assigning the taxonomy with SILVA v138 wasn't satisfying at the specie level, so we tested the assignation with EZbiocloud on a few sequences first, on the website like so:
Do you know how the taxonomy assignment algorithm works within EzBioCloud? I've not used this tool before. I suspect it is similar to BLAST?
Many tools simply take the top BLAST hit to a given reference database. However, the top hit is not always correct, as that hit might be arbitrarily sorted to the top, despite having hundreds or thousands of equally likely hits listed below a given hit. For example, many organisms have the exact same sequence over a given sequenced region, and can not be disambiguated. The fit-classifier-naive-bayes take this into account and will return the lowest common ancestor (LCA) when multiple taxa have identical sequence.
For example see this thread:
I might also add that, it is very difficult to expect species-level classifications with short amplicon reads. There are even cases in which having the full length 16S rRNA gene sequences can not disambiguate between species or genera!
EZbiocloud uses the VSEARCH program to assign taxonomy. To quote EZbiocloud's website:
"Dereplicated sequences are then subjected to taxonomic assignment. We use VSEARCH program (Rognes et al. 2016) to search and calculate sequence similarities of the query NGS reads against the EzBioCloud 16S database. 97% 16S similarity is used as the cutoff for species-level identification. Other sequence similarity cut-offs are used for genus or higher taxonomic ranks.
x = sequence similarity to reference sequences; species (x ≥ 97%), genus (97> x ≥94.5%), family (94.5> x ≥86.5%), order (86.5> x ≥82%), class (82> x ≥78.5%), and phylum (78.5> x ≥75%). Cutoff values are taken from Yarza et al. (2014).
To reduce computation and accuracy, we built different versions of reference 16S databases that match various regions of 16S sequences. For example, full-length version (V1-V9) is used for PacBio ccs data whereas the V3-V4 version is used for MiSeq 250 bp paired-end sequencing data."
Thank you for your explanation of the q2-feature-classifier, I have a better understanding of how it works.