Confidence Intervals greater than 1 in custom classifier

Jasmine · May 19, 2020, 5:47pm

Hello,

This is my first time posting; please let me know if I was to format my question differently.

**I suspect that I am not properly extracting reads as I attempt to create a custom Silva classifier. Some of the taxonomic classifications have confidence intervals slightly greater than 1. **

Here are the commands I use to import and extract sequences

qiime tools import
--type 'FeatureData[Sequence]'
--input-path SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_132_99_16S.fna
--output-path 99_otus.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat
--input-path SILVA_132_QIIME_release/taxonomy/16S_only/99/taxonomy_all_levels.txt
--output-path ref-taxonomy.qza

qiime feature-classifier extract-reads
--i-sequences 99_otus.qza
--p-f-primer AGAGTTTGATCMTGGCTCAG
--p-r-primer ACTCCTACGGGAGGCAGC
--p-trunc-len 325
--p-trim-left 20
--p-min-length 300
--p-max-length 400
--o-reads extracted_ref-seqs.qza

**Again, I suspect I am not using appropriate parameters in the above "feature-classifier extract-reads" command. Perhaps the trim, trunc, or length parameters? **

Here are the commands I used for training and testing the classifier, just in case:

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads extracted_ref-seqs.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classifier silva_16s_v1-v2_custom_nb-classifier.qza

qiime feature-classifier classify-sklearn
--i-classifier silva_16s_v1-v2_custom_nb-classifier.qza
--i-reads merged_SeqRun1and2_rep-seqs.qza
--o-classification taxonomy_silva_custom_v1-v2.qza

qiime metadata tabulate
--m-input-file merged_SeqRun1and2_rep-seqs.qza
--m-input-file taxonomy_silva_custom_v1-v2.qza
--o-visualization taxonomy_silva_custom_v1-v2.qzv

Background information; sequencing:

Ion Torrent PGM
V1 - V2 (27F and 355R)
Forward reads only
Ion Torrent adaptor and barcode tagged already demultiplexed by sequencing facility
Sequence still contains 27F and 355R primers

Background information; denoising:

qiime dada2 denoise-pyro
--i-demultiplexed-seqs SeqRun1_imported.qza
--p-trim-left 20
--p-trunc-len 325
--o-representative-sequences SeqRun1_rep-seqs-dada.qza
--o-table SeqRun1_table-dada2.qza
--o-denoising-stats SeqRun1_stats-dada2.qza

I use the same exact parameters to denoise "SeqRun2". The two sequence runs are different samples (not replicates).

I trim off the 20 nucleotide forward primer. I truncate at position 325 in both denoising runs, because (1) this removes the 18 nucleotide reverse primer sequence, and (2) both runs show a sequence quality drop off at that common position.

Background information, merging and grouping:

I run "feature-table merge" and "feature-table merge-seqs" to merge the feature-tables and rep seqs from both sequence/denoising runs.

Above, I stated that the "two sequence runs are different samples (not replicates)", this is true, but I did have a few same-sample-replicates within each of the two runs.

For example, in SeqRun1, I have microbiomeSample1, microbiomeSample1_again, microbiomeSample2, microbiomeSample3...etc.
Then, in SeqRun2, I have microbiomeSample101, microbiomesample102, microbiomesample103, microbiomesample103_again.

I run "feature-table group" to group these replicates. The inputs include the merged feature table and a custom metadata file to facilitate groupings.

Nicholas_Bokulich · May 29, 2020, 2:07pm

Hi @Jasmine,
I apologize for the delay. Have you made any progress on troubleshooting?

Could you please share this QZV so that I can take a look? taxonomy_silva_custom_v1-v2.qzv

Note that in general naive Bayes classifiers are good at classifying but poor at estimating probabilities, so the "confidence" scores should not be taken too seriously.

I see one little problem. extract-reads trims off the primers, so setting --p-trim-left 20 is unnecessary in that action, if the parameter with the same name is being used to trim off the forward primer with dada2 denoise-pyro. Instead, the result will be that your reference sequences are being trimmed more than the query seqs, which could affect classification.

That probably does not explain confidences > 1, but it is probably impacting your results.

Jasmine · June 26, 2020, 4:14pm

Hello,

I am sorry for the late response; I decided to confirm data uploads with my PI first.

Thank you for the suggestion to set --p-trim-left 0 instead. I tried this and it does seem to help a bit with overall classifications! I also omitted the other 3 'length parameter' settings too.

I still have features with confidence intervals slightly > 1 though. I'm attaching an example qzv file here.

The attached qzv file was classified with custom v1-v2 Greengenes reference, but all the feature classifier steps are the same as those I'd specified for Silva earlier 08b_01q3_ggCustomv1-v2Classifications.qzv (1.3 MB) .

Thanks again,
Jasmine

Nicholas_Bokulich · June 30, 2020, 7:30pm

Hi @Jasmine,
The confidence scores look very marginally above 1.0, so this looks like rounding error. These scores come directly from scikit-learn so it is an upstream error that would need to be fixed in scikit-learn.

I think it is safe to ignore/round down for your purposes, given that (1) the small magnitude of this rounding error, (2) the fact that these confidence scores are not used anywhere downstream (so the rounding error would not propagate), and (3) these probability scores are used as a rough estimate of confidence anyway.

Or maybe it's just q2-feature-classifier's way of telling you that it's really confident about its classifications

Let me know if you have any more questions or concerns!

system · August 1, 2020, 1:30am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.