Train classifier on single-end sequence without reverse primer

I want to train a classifier for taxonomy assignment. Unfortunately, I have only known the LinkerPrimerSequence - TATGGTAATTGTGTGYCAGCMGCCGCGGTAA, the sequence is single end with 150 bp length from v4 region.
I need to extract reads before training classifier by following tutorial. Here I want to use my own primer sequence as forward primer and the reverse primer sequence (806R) provided by the tutorial:

qiime feature-classifier extract-reads
–i-sequences gg_13_8_otus/rep_set/99_otus.qza
–p-trunc-len 150
–o-reads trunc-99-otus-seqs.qza

then train the classifier as ;

qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads trunc-99-otus-seqs.qza
–i-reference-taxonomy ref-99-taxonomy.qza
–o-classifier classifier.qza

Did I follow the correct path?



Hi @lindd,
It looks like you are using the 515f/806r primer pair. So I would recommend just using the pre-trained V4 classifiers available here. If you do want to train your own classifier, you can just use the primer sequences used in the tutorial.

You should not train a classifier with the linkerprimer sequence that you listed there — that contains non-biological DNA and hence will result in very few extracted reads.

I hope that helps!

1 Like

Thanks @Nicholas_Bokulich. Just want to update this. I tried to run the trimming and the train the classifier. The accuracy looks a little bit better than using the classifier from the whole v4 region or full sequence. Some OTUs have higher confidence level and are classified to more specific taxa. But I think this slight difference may not change the statistical results much.



1 Like

Hi @lindd,

That’s pretty consistent with my own findings, particularly vs. full 16S. I find that extracting the primers helps but trimming to the amplicon size has much less of an effect, and hence usually just suggest that users use the pre-trained V4 classifier if those are the primers they are using.

Unless if you know the correct assignment, deeper classification is not necessarily better… How are you measuring accuracy? Are you testing this on simulated or artificial/mock communities where you know the true composition of the sample?

Thanks for following up! Glad to hear our results are aligning…

Hi @Nicholas_Bokulich,
You are right. I don’t know the ground truth of these sequences, so cannot claim that the classification is improved or not. Both classifiers on trimmed sequence or v4 regions are not that different in analysis. Will just pick either of them :slightly_smiling_face:

Thanks again for your prompt and helpful response.



This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.