EZbiocloud classifier


I would like to create a classifier from EZbiocloud database. I downloaded the sequence and taxa files, made the classifier and used it on my dataset (16S, V3-V4 region). Out of 4389 ASVs 4348 were classified as:

Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales;Desulfobacteraceae;Desulfamplus;Desulfobacterium niacini

For human stool samples this is certainly not what it should look like. With Silva138 classifier, made using rescript tutorial, the results for these data look totally normal.

I downloaded the rep-seqs from here https://data.qiime2.org/2022.2/tutorials/training-feature-classifiers/rep-seqs.qza --- and with these sequences the EZbiocloud classifier worked just fine (results comparable to Silva138). For both databases, the V3-V4 region was extracted using same primer sequences.

I took as well a small subset of reads from ref-seqs_EZ_V3V4.qza (after V3V4 extraction) and compared to the ASV reads in my dataset and the start and end align well...



Qiime version : QIIME2/2021.8

Commands to make the classifier:

qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path ezbiocloud_qiime_full.fasta \
--output-path ezbio.qza

qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path ezbiocloud_id_taxonomy.txt \
--output-path ref-taxonomy.qza

qiime feature-classifier extract-reads \
--i-sequences ezbio.qza \
    --p-f-primer CCTACGGGNGGCWGCAG \
    --o-reads ref-seqs_EZ_V3V4.qza

qiime feature-classifier fit-classifier-naive-bayes \
 --i-reference-reads ref-seqs_EZ_V3V4.qza \
 --i-reference-taxonomy ref-taxonomy.qza \
 --o-classifier classifier_EZ_V3V4.qza

Do you have any ideas why this EZbiocloud classifier does not work with my dataset? (but works with the test rep-seq (guess V4 seqs) and the dataset itself looks fine and gives normal results with Silva138)

Thank you in advance!!

Hi @rahel_park,

Welcome back to the :qiime2: forum!

I'm certainly no expert in this area, but @Nicholas_Bokulich provided some suggestions that might be useful here!

The issue is probably a "junk" sequence that is leading to incorrect assignment. Search for some old discussions of "hot spring meta-genome" classifications using SILVA.

Here is one red flag:

Min and mix lengths should be set to the expected range. Very short non-target hits in SILVA led to the "hot spring meta-genome" issue mentioned above.

In this case, since the target is V3-V4 - the issue might actually be the max length (default should be 400, though this depends on the version used)... most reference sequences are probably being filtered out (thus leaving only Desulfobacterium niacini). This is something you should check on your end.

RESCRIPt could be used here for more sensitive length filtering (e.g., if different length ranges are needed for different clades), also for testing/validating the database.

Hope this helps!

Cheers :lizard:

Hi @lizgehret

Thank you for the reply. I will try with adding the length parameters and see if this fixes the problem. The weird thing for me was that the classifier worked with another dataset, but indeed the other dataset has shorter amplicons, so hopefully I can fix the issue.

Thanks for the tips!