Trained classifier returns less features after taxonomy assignment

r.kendar · September 29, 2019, 6:42am

Hi QIIME Team,

I am running 16S V3V4 (341F/805R region) analysis on q2cli version 2019.7.0 on AWS. The reads were obtained from 2x300bp MiSeq run and I have trimmed them using q2-cutadapt prior DADA2 denoising. From the feature-table, it showed I get 1255 features. Then I performed classifier training to gg_13_8 on 99 OTUs as instructed with the adjusting parameter as below:

Import

qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path 99_otus.fasta \
  --output-path 99_otus.qza

qiime tools import \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat \
  --input-path 99_otu_taxonomy.txt \
  --output-path ref-taxonomy.qza

Extract Ref

qiime feature-classifier extract-reads \
  --i-sequences 99_otus.qza \
  --p-f-primer CCTACGGGNGGCWGCAG \
  --p-r-primer GACTACHVGGGTATCTAATCC \
  --p-trunc-len 430 \
  --p-min-length 300 \
  --p-max-length 500 \
  --o-reads ref-seqs.qza \
  --verbose

I choose the parameters based on the feature table report on length distribution. You can see the detail here trim-rep-seqs1.qzv (425.3 KB) trim-table1.qzv (464.6 KB)

Training

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ref-seqs.qza \
  --i-reference-taxonomy ref-taxonomy.qza \
  --o-classifier gg-13-8-99-otu-v3v4-illumina-classifier.qza \
  --verbose

And then I tested the classifier to my dataset, but it returned only 775 assigned features (from total of 1255) metadata_342-805.tsv (106.5 KB) . However, if I used the ready to go classifier on your website Greengenes 13_8 99% OTUs from 515F/806R region of sequences it gave me all 1225 assigned features metadata_515-806.tsv (178.0 KB).

So, my question is:

I know that it is better to use the classifier that trained using my own data but why I get less assigned features using my trained classifier? Is there anything wrong with the parameter when I extracting the reference?
Originally, I want to test my classifier on the mockrobiota that hyperlinked in your website. But I cannot find the dataset that optimized for V3V4, only V4. Do you have any suggestion on how I can evaluate my classifier whether it is doing good/not?

Thank you.

Nicholas_Bokulich · September 30, 2019, 4:14pm

Welcome to the forum @r.kendar!

Could you please upload these as QZVs? (use qiime metadata tabulate to convert to a QZV)

That will allow me to check the provenance

The classifier should not output a different number of features — it should even report unclassified sequences as such. This is why I would like to see the QZVs; to inspect the provenance in some more detail.

The extraction parameters should not affect this either; you are trimming the reference sequences, not your query sequences.

You are correct, mockrobiota does not have any V3V4 datasets, it could certainly use some

However, you could still test your classifier on a V4 dataset since the 515-806r primers used for those V4 datasets amplify a region that is entirely covered by the V3V4 primers that you are using (well, almost — it looks like your reverse primer is 2nt longer than the commonly used 806r primer, which I think is what is used by those V4 mock communities, so you might need to trim off those 2 nt if you are using full V4 mock communities, i.e. if you are able to merge the reads, but those 2 nt probably won't really make a big difference).

system · October 31, 2019, 10:14pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.