Confidence Intervals greater than 1 in custom classifier

Hello,

This is my first time posting; please let me know if I was to format my question differently.

**I suspect that I am not properly extracting reads as I attempt to create a custom Silva classifier. Some of the taxonomic classifications have confidence intervals slightly greater than 1. **

Here are the commands I use to import and extract sequences

qiime tools import
–type ‘FeatureData[Sequence]’
–input-path SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_132_99_16S.fna
–output-path 99_otus.qza

qiime tools import
–type ‘FeatureData[Taxonomy]’
–input-format HeaderlessTSVTaxonomyFormat
–input-path SILVA_132_QIIME_release/taxonomy/16S_only/99/taxonomy_all_levels.txt
–output-path ref-taxonomy.qza

qiime feature-classifier extract-reads
–i-sequences 99_otus.qza
–p-f-primer AGAGTTTGATCMTGGCTCAG
–p-r-primer ACTCCTACGGGAGGCAGC
–p-trunc-len 325
–p-trim-left 20
–p-min-length 300
–p-max-length 400
–o-reads extracted_ref-seqs.qza

**Again, I suspect I am not using appropriate parameters in the above “feature-classifier extract-reads” command. Perhaps the trim, trunc, or length parameters? **

Here are the commands I used for training and testing the classifier, just in case:

qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads extracted_ref-seqs.qza
–i-reference-taxonomy ref-taxonomy.qza
–o-classifier silva_16s_v1-v2_custom_nb-classifier.qza

qiime feature-classifier classify-sklearn
–i-classifier silva_16s_v1-v2_custom_nb-classifier.qza
–i-reads merged_SeqRun1and2_rep-seqs.qza
–o-classification taxonomy_silva_custom_v1-v2.qza

qiime metadata tabulate
–m-input-file merged_SeqRun1and2_rep-seqs.qza
–m-input-file taxonomy_silva_custom_v1-v2.qza
–o-visualization taxonomy_silva_custom_v1-v2.qzv

Background information; sequencing:

  • Ion Torrent PGM
  • V1 - V2 (27F and 355R)
  • Forward reads only
  • Ion Torrent adaptor and barcode tagged already demultiplexed by sequencing facility
  • Sequence still contains 27F and 355R primers

Background information; denoising:

qiime dada2 denoise-pyro
–i-demultiplexed-seqs SeqRun1_imported.qza
–p-trim-left 20
–p-trunc-len 325
–o-representative-sequences SeqRun1_rep-seqs-dada.qza
–o-table SeqRun1_table-dada2.qza
–o-denoising-stats SeqRun1_stats-dada2.qza

I use the same exact parameters to denoise “SeqRun2”. The two sequence runs are different samples (not replicates).

I trim off the 20 nucleotide forward primer. I truncate at position 325 in both denoising runs, because (1) this removes the 18 nucleotide reverse primer sequence, and (2) both runs show a sequence quality drop off at that common position.

Background information, merging and grouping:

I run “feature-table merge” and “feature-table merge-seqs” to merge the feature-tables and rep seqs from both sequence/denoising runs.

Above, I stated that the “two sequence runs are different samples (not replicates)”, this is true, but I did have a few same-sample-replicates within each of the two runs.

  • For example, in SeqRun1, I have microbiomeSample1, microbiomeSample1_again, microbiomeSample2, microbiomeSample3…etc.

  • Then, in SeqRun2, I have microbiomeSample101, microbiomesample102, microbiomesample103, microbiomesample103_again.

I run “feature-table group” to group these replicates. The inputs include the merged feature table and a custom metadata file to facilitate groupings.

Hi @Jasmine,
I apologize for the delay. Have you made any progress on troubleshooting?

Could you please share this QZV so that I can take a look? taxonomy_silva_custom_v1-v2.qzv

Note that in general naive Bayes classifiers are good at classifying but poor at estimating probabilities, so the “confidence” scores should not be taken too seriously.

I see one little problem. extract-reads trims off the primers, so setting --p-trim-left 20 is unnecessary in that action, if the parameter with the same name is being used to trim off the forward primer with dada2 denoise-pyro. Instead, the result will be that your reference sequences are being trimmed more than the query seqs, which could affect classification.

That probably does not explain confidences > 1, but it is probably impacting your results.