Trained classifier returns less features after taxonomy assignment

Hi QIIME Team,

I am running 16S V3V4 (341F/805R region) analysis on q2cli version 2019.7.0 on AWS. The reads were obtained from 2x300bp MiSeq run and I have trimmed them using q2-cutadapt prior DADA2 denoising. From the feature-table, it showed I get 1255 features. Then I performed classifier training to gg_13_8 on 99 OTUs as instructed with the adjusting parameter as below:

  1. Import
qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path 99_otus.fasta \
  --output-path 99_otus.qza

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path 99_otu_taxonomy.txt \
  --output-path ref-taxonomy.qza
  1. Extract Ref
qiime feature-classifier extract-reads \
  --i-sequences 99_otus.qza \
  --p-f-primer CCTACGGGNGGCWGCAG \
  --p-r-primer GACTACHVGGGTATCTAATCC \
  --p-trunc-len 430 \
  --p-min-length 300 \
  --p-max-length 500 \
  --o-reads ref-seqs.qza \
  --verbose

I choose the parameters based on the feature table report on length distribution. You can see the detail here trim-rep-seqs1.qzv (425.3 KB) trim-table1.qzv (464.6 KB)

  1. Training
qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ref-seqs.qza \
  --i-reference-taxonomy ref-taxonomy.qza \
  --o-classifier gg-13-8-99-otu-v3v4-illumina-classifier.qza \
  --verbose

And then I tested the classifier to my dataset, but it returned only 775 assigned features (from total of 1255) metadata_342-805.tsv (106.5 KB) . However, if I used the ready to go classifier on your website Greengenes 13_8 99% OTUs from 515F/806R region of sequences it gave me all 1225 assigned features metadata_515-806.tsv (178.0 KB).

So, my question is:

  1. I know that it is better to use the classifier that trained using my own data but why I get less assigned features using my trained classifier? Is there anything wrong with the parameter when I extracting the reference?
  2. Originally, I want to test my classifier on the mockrobiota that hyperlinked in your website. But I cannot find the dataset that optimized for V3V4, only V4. Do you have any suggestion on how I can evaluate my classifier whether it is doing good/not?

Thank you.

Welcome to the forum @r.kendar!

Could you please upload these as QZVs? (use qiime metadata tabulate to convert to a QZV)

That will allow me to check the provenance

The classifier should not output a different number of features — it should even report unclassified sequences as such. This is why I would like to see the QZVs; to inspect the provenance in some more detail.

The extraction parameters should not affect this either; you are trimming the reference sequences, not your query sequences.

You are correct, mockrobiota does not have any V3V4 datasets, it could certainly use some :smile:

However, you could still test your classifier on a V4 dataset since the 515-806r primers used for those V4 datasets amplify a region that is entirely covered by the V3V4 primers that you are using (well, almost — it looks like your reverse primer is 2nt longer than the commonly used 806r primer, which I think is what is used by those V4 mock communities, so you might need to trim off those 2 nt if you are using full V4 mock communities, i.e. if you are able to merge the reads, but those 2 nt probably won't really make a big difference).

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.