Trying to train a classifier with my own data

Hi everyone!

I have performed a single-end analysis using only the R1 reads from my sequencing due to a quality issue I experienced related with the reverse reads, which issued the merging of my sequences.

I used the following dataset for the taxonomic classification (rep-seqs file and table)
table-single.qzv (454.6 KB)
rep-seqs-single.qzv (318.0 KB)

Initially used the pre-trained classifier provided by QIIME2 (silva 138.99).

Then, I attempted to train a classifier with my own data following the tutorial "Training feature classifiers with q2-feature-classifier" using the SILVA 138.99 files for Feature[sequence] and Feature[taxonomy] provided on the resource page. However, the result was that it only identified one species, and the rest remained unassigned.

I also made the same attempt with version 132 of SILVA at 99% identity, and almost all fragments were classified as unassigned.

Then, I made an attempt with the new plugin of Qiime2, Greengene2, following the instructions provided in the section for regions other than V4. I'm not sure if this section is optimized, but I was able to generate the .qzv file for taxonomy. However, when I tried to generate the code for taxa_bar_plots, I encountered the following error.
taxonomy.gg2.1.qzv (1.2 MB)

> Traceback (most recent call last):
  File "/home/julia/anaconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/q2cli/", line 352, in __call__
    results = action(**arguments)
  File "<decorator-gen-485>", line 2, in barplot
  File "/home/julia/anaconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/sdk/", line 234, in bound_callable
![Screenshot from 2023-05-26 12-38-58|580x500](upload://mvoGtEkG3jjN8L346qKIlv1KzM7.png)

    outputs = self._callable_executor_(scope, callable_args,
  File "/home/julia/anaconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/sdk/", line 443, in _callable_executor_
    ret_val = self._callable(output_dir=temp_dir, **view_args)
  File "/home/julia/anaconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/q2_taxa/", line 41, in barplot
    collapsed_tables = _extract_to_level(taxonomy, table)
  File "/home/julia/anaconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/q2_taxa/", line 42, in _extract_to_level
    collapsed_table = _collapse_table(table, taxonomy, level, max_obs_lvl)
  File "/home/julia/anaconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/q2_taxa/", line 19, in _collapse_table
    raise ValueError('Feature IDs found in the table are missing from the '
ValueError: Feature IDs found in the table are missing from the taxonomy: {'0ea89af03489cd0042e244da95717306', 'd31d618b00510a936ad75bf753bd8111', 'b3b318ac62506795e8a41ee289431779', '60fc710df6345a9b2b7f31ac1500231c', '0640579fca999d04069119566d6bea23', '5cfc2e63284826be749078d590de2e8a', '7006270b2117277ace9e3eb44fed625d', '567c2a18b33de5a199fcc707626b04c7', '8b49a8e2010e6d41a7a18c1ce2e9b0ba', '487405dc03bc3775979159b5c5a54af4',

I understand that I have just provided you with a lot of information at once. I'm not sure if it would be preferable for me to generate an additional post explaining the issues I am encountering in classification using my own classifier or using GreenGen2.

I also tried to use the training-classifier tutorial using the old version of greengene 13_5 and this were the results

In those tries were I used the training-classifier i used this code for the extraction of the reads:

qiime feature-classifier extract-reads \
  --i-sequences ******
  --p-r-primer CTGSTGCVNCCCGTAGG \
  --p-max-length 500 \
  --o-reads ****

Thanks a lot
Warmest regards,

Hello Julia,

I'm confused. I see a lot of Firmicutes:

What did you expect to see in your positive control and what did you see instead?

1 Like

I apologize, please ignore the mention of Firmicutes. It was a fragment of text that accidentally got mixed in, referring to a classification attempt that I ultimately decided not to include.

OK, that's fine.

Let us slow down. Have some tea with me :tea:

How did you evaluate these results? If you choose to use something 'better' what problems will this new method (hopefully!) solve?

You mentioned you have some positive controls. How did those turn out?

Let's also zoom out. :telescope:

There are lots of questions you may ask, and taxonomy only answers a few of them.
What is the central biological question you are investigating?

Hello @colinbrislawn thank you for your response.

I'm actually not evaluating my results because so far the only attempt from which I have obtained a sufficient number of OTUs is the first attempt I made with a pre-trained classifier, which, as I have read on the QIIME2 website, is more accurate if you train it with your own data. Therefore, I have it as a reference but not as a definitive result.
In the attempt with Greengene2, I encounter an error message directly, so I cannot proceed.

With the outdated version of Greengene, it's only able to detect that the sequences are from bacteria (it stops at level 2 and gives me bacteria as the OTU for 98% of the references).
When training the classifier with SILVA 138 at 99%, it only detects streptomycetes.

These results indicate that there is a step that is not being executed properly, probably in the classifier training part, and my current priority is to be able to perform the training correctly.
The biological response I am seeking is to observe if there are variations in the microbiota of transgene carriers and assess the influence of sex and a treatment.

1 Like

Yes, that makes sense. Because you have unique primers, the nb-skl classifier would work best after customization. You could also try a top-hit LCA classifier like vsearch or blast, which may benefit from customization but does not require it.

We should have mentioned this earlier, but there's a new plugin for building custom databases. :point_down:

This sounds great! You can still detect specific microbes that server as biomarkers for this change then give them taxonomy labels later. And measuring the magnitude and significance of microbiome variation does not use taxonomy at all.

1 Like

Hello @colinbrislawn, thanks for your response,

It should perform better with a trained classifier, but in my case, it has worked better when I used the pretrained classifier. I followed the tutorial, using the following data resources that have been previously processed with RESCRIPt:

Silva 138 SSURef NR99 full-length sequences
Silva 138 SSURef NR99 full-length taxonomy

If the classification is performed correctly using the pretrained algorithm, but when I try to train it using my primers, only one species is detected, there is something that I am not doing correctly, and I would like to identify the error.

The code I have used is as follows:

qiime feature-classifier extract-reads \
  --i-sequences silva-138-99-seqs.qza \
  --p-r-primer CTGSTGCVNCCCGTAGG \
  --p-max-length 500 \
  --o-reads silva-138-ref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads silva-138-ref-seqs.qza \
  --i-reference-taxonomy silva-138-99-tax.qza \
  --o-classifier classifier-99-silva-138.qza

qiime feature-classifier classify-sklearn \
  --i-classifier classifier-99-silva-138.qza \
  --i-reads rep-seqs-single.qza \
  --o-classification taxonomy-silva138-99.qza

qiime metadata tabulate \
  --m-input-file taxonomy-silva138-99.qza \
  --o-visualization taxonomy-silva138-99.qzv

qiime taxa barplot \
  --i-table table-single.qza \
  --i-taxonomy taxonomy-silva138-99.qzv \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization taxa-bar-plots-138.qzv

Additionally, I'm unsure if you have had a chance to review the error I mentioned regarding the use of the greengene2 plugin.


My suggestion is to use RESCRIPt instead of that training tutorial, as RESCRIPt includes extra details and steps that will help us understand which parts of the pipeline are working well, and which steps are causing the problems.

I suspect the problem is within the extract-reads step, and RESCRIPt should tell us more.

This is a good idea, threads with a single issue tend to be answered faster. Include your full GG2 pipeline.

Hi @VerheulJulia, I'd like to add to @colinbrislawn's suggestions...

  1. Can you provide a reference for the primers used?

  2. You can supplement the extract-reads step, with this approach, as referenced via the SILVA tutorial at this step.



Hi @SoilRotifer @colinbrislawn

Apologies for the delay, as I had to attend to other matters and set aside this analysis. However, I have been able to complete it this week. The sequencing company had initially provided me with incorrect primers, which resulted in a mismatch between the reference sequences and the region of my sequences whenever I attempted to create a classifier. Fortunately, I managed to obtain the correct primers, and now the classifier training is functioning perfectly. I have also tested using RESCRIPt and achieved favorable results. Thank you very much for your assistance!


Hi @VerheulJulia, thank you for letting us know everything worked out. :smiley: :pray:

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.