qiime2 feature-classifier classify-sklearn

Angela · November 29, 2021, 2:30pm

Hi everyone!

I am using qiime2 to try taxonomic classification of an amplicon sequencing ITS dataset.

I trained the classifier:

(qiime2-2020.11) [root@hpml350g8 4gen21]# qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads sh_refs_qiime_ver8_dynamic_all_04.02.2020_dev.qza --i-reference-taxonomy sh_taxonomy_qiime_ver8_dynamic_all_04.02.2020_dev.qza --o-classifier classifier_qiime_ver8_dynamic_all_04.02.2020_dev.qza

Then I did classification with sklearn:

(qiime2-2020.11) [root@hpml350g8 all_samples_BF2]# qiime feature-classifier classify-sklearn --i-classifier classifier_qiime_ver8_dynamic_all_04.02.2020_dev.qza --i-reads all_samples_rep-seqs-dn-99_BF2.qza --o-classification all_samples_taxonomy_BF2.qza

I noticed that sometime, depending on the number of reads, the use of the plugin “qiime feature-classifier classify-sklearn” may produce an output where most of the features are “unidentified” and very few features are classified in only two alternative ways.

In particular, I get very different results if I use reads deriving from different denoising with dada2: denoised reads that gave me, for example, a total of 1585260 reads (for 90 samples) and 1157 different features were classified as follows

while a second set were I had a total of 1282485 reads (for the same 90 samples) and 1135 different features were classified as follows

In the two classifications shown here, features with the same ID (that are exactly the same) are classified totally differently.

I really can’t understand where is the problem

Thanks to everyone that could help me!
Angela

colinbrislawn · November 30, 2021, 8:25pm

Hello Angela,

We have a mystery on our hands!

That's not supposed to happen! The same classifier should always report the same value for the same ASVs. Of course, using a different classifier with --i-classifier will give you different results.

Would you be willing to help us replicate this bug? If you could post the full command you ran along with the input files, we can see how it works on our systems.

Colin

P.S. Welcome to the forums! :qiime2:

Angela · December 8, 2021, 6:13pm

Hello Colin,
Thank you for your answer. It’s wanderful for me to be in the forum!

I attach the link for my file. I hope it will work

https://www.dropbox.com/s/akda5ki0yihzwet/reads_131720_trimmed.qza?dl=0

And here follow two series of commands:

this gave me a lot of “unidentified” AVS

(qiime2-2020.11) [root@hpml350g8 7dic21]# qiime dada2 denoise-paired --i-demultiplexed-seqs reads_131720_trimmed.qza --o-table otu_table_131720 --o-representative-sequences otus_131720 --o-denoising-stats dada2_stats_131720 --p-trunc-len-f 264 --p-trunc-len-r 170

(qiime2-2020.11) [root@hpml350g8 7dic21]# qiime vsearch cluster-features-de-novo --i-table otu_table_131720.qza --i-sequences otus_131720.qza --p-perc-identity 0.99 --o-clustered-table table-dn-99_131720.qza --o-clustered-sequences rep_seqs-dn-99_131720.qza

(qiime2-2020.11) [root@hpml350g8 7dic21]# qiime feature-classifier classify-sklearn --i-classifier classifier_qiime_ver8_dynamic_all_04.02.2020_dev.qza --i-reads rep_seqs-dn-99_131720.qza --o-classification taxonomy_131720.qza

this gave me a good classification

(qiime2-2020.11) [root@hpml350g8 7dic21]# qiime dada2 denoise-paired --i-demultiplexed-seqs reads_131720_trimmed.qza --o-table otu_table_131720_2 --o-representative-sequences otus_131720_2 --o-denoising-stats dada2_stats_131720_2 --p-trunc-len-f 264 --p-trunc-len-r 198

(qiime2-2020.11) [root@hpml350g8 7dic21]# qiime vsearch cluster-features-de-novo --i-table otu_table_131720_2.qza --i-sequences otus_131720_2.qza --p-perc-identity 0.99 --o-clustered-table table-dn-99_131720_2.qza --o-clustered-sequences rep_seqs-dn-99_131720_2.qza

(qiime2-2020.11) [root@hpml350g8 7dic21]# qiime feature-classifier classify-sklearn --i-classifier classifier_qiime_ver8_dynamic_all_04.02.2020_dev.qza --i-reads rep_seqs-dn-99_131720_2.qza --o-classification taxonomy_131720_2.qza

Thank you in advance for your help
Angela

colinbrislawn · December 8, 2021, 6:56pm

Hello again Angela,

Ah, this is becoming more clear. While your --i-classifier is the same, changes in your dada2 --p-trunc-len-r will cause changes in truncation, which can cause changes in joining, that propagate downstream.

Have you compared your two --o-denoising-stats files to see how many reads were able to join, and how long, on average, were the ones that did?

Colin

Angela · December 10, 2021, 9:05am

Hello Colin,
I compared the two --o-denoising-stats files. After dada2 denoising and filtering I obtained the following non-chimeric amounts of sequences. So I see that if I trunc my reverse reads at 170 I obtain more sequences maybe because in this way I take advantage of a better quality of reverse reads

264_170
sample-id input filtered denoised merged non-chimeric
#q2:types numeric numeric numeric numeric numeric
OLW13 62082 36641 36627 36525 31960
OLW17 116803 77743 77413 76852 64831
OLW20 32907 8060 7995 7774 7254

264_198
sample-id input filtered denoised merged non-chimeric
#q2:types numeric numeric numeric numeric numeric
OLW13 62082 32592 32580 32489 28430
OLW17 116803 68917 68601 67936 57580
OLW20 32907 3073 2996 2858 2684

And also truncation at 170 gives me sequences that, on average, are longer than those obtained with truncation at 198

--p-trunc-len-r

Statistic 170 198
count 49 42
min 268 268
max 415 443
mean 361.939 357.833
range 147 175
std 385.624 395.042

So I thik that truncation at 170 is the best for me, but in this way I can’t have a good classification.

Most of all, what I can’t understand is that with the two alternative truncations I obtain features with the same ID that are classified differently.

Thanks,

Angela

Nicholas_Bokulich · December 10, 2021, 11:40am

Hi @Angela and @colinbrislawn ,

This is not a bug, this is most likely due to mixed read orientations; trimming to different lengths is probably changing the inclusion or order of sequences that are used for read orientation prediction. See here for an explanation:

Some of your reads are being poorly classified no matter the direction, so it could also be that there are many non-target reads in your samples that are interfering with the orientation detector (since the orientation is chosen based on match to a reference sequence).

You can also specify the read orientation if this is known and you do not want classify-sklearn to choose for you.

Angela · December 13, 2021, 2:39pm

Hi @Nicholas_Bokulich,

it worked!

I added --p-read-orientation same to the classification step and finally classification made sense, regardless of the setting for truncation during denoising and of the number of sequences I had.

My problem is solved! Thank you both for your helping,

Angela

system · January 13, 2022, 8:39pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.