different results from self-trained full-length and v4 classifiers

Hi all,

I respectively trained a full-length classifier and a v4 region classifier based on a personalized database including hundreds of 16S rRNA genes, and assigned my dataset against them. But the results based on different classifiers turned out really different from each other. Is it normal?

Here shows the command I used to train the classifier and the assignment results:
qiime tools import
–type ‘FeatureData[Sequence]’
–input-path pathogen.seqs.fasta
–output-path pathogen.seqs.qza
qiime tools import
–type ‘FeatureData[Taxonomy]’
–input-format HeaderlessTSVTaxonomyFormat
–input-path taxonomy.pathogen.txt
–output-path taxonomy.pathogen.qza
qiime feature-classifier extract-reads
–i-sequences pathogen.seqs.qza
–p-f-primer GTGCCAGCMGCCGCGGTAA
–p-r-primer GGACTACVSGGGTATCTAAT
–p-trunc-len 273
–o-reads ref-seqs-v4.qza
qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads ref-seqs-v4.qza
–i-reference-taxonomy taxonomy.pathogen.qza
–o-classifier classifier-v4.qza
qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads pathogen.seqs.qza
–i-reference-taxonomy taxonomy.pathogen.qza
–o-classifier classifier-t.qza
taxa-bar.pathogen.v4.qzv (329.5 KB) taxa-bar.pathogen.t.qzv (324.5 KB)

Thanks in advance!

Xiaolan

1 Like

Good morning Xiaolan,

Welcome to the Qiime 2 forums! :qiime2:

Yes, that sounds normal. Different databases should provide different taxonomy results.

Which two databases did you use? Have you tried using a ‘standard’ database like SILVA?

Colin

Morning @colinbrislawn,

Thank you for your reply!
Actually the database I used was built by ourselve, containing hundreds of pathogen 16S rRNA sequences, in order to investigate the potential pathogens in our samples. Then I trained it into full-length classifier and just v4 classifier based on our primer to see if there any diffference.

Yes, I also tried trained SILVA classifier, but it may not the best choose to attain our goal.

Thanks a lot!

Xiaolan

1 Like

Hi @Xiaolan_Lin,
As @colinbrislawn noted, differences are expected between classifiers trained on full-length 16S vs. variable regions. Your differences may be unexpected, but I see one major issue that should be addressed first before evaluating:

You have a very large number of unclassified reads. This appears to be due to the way you have structured the taxonomy annotations, see this topic for more details:

Once that’s fixed you will be able to better evaluate how different these classifiers are performing.

Something else to evaluate is how many sequences are being lost during read extraction. If your primers have poor coverage, you could be losing many reference sequences in the process. Low coverage (lots of sequences lost after read extraction) will impact classification results… it will make a low-quality classifier!

One issue to be aware of when making a custom database is that you want to have “outgroups” in your data — otherwise the classifier could be prone to false-positive identification because you are overfitting to a restricted set of possible taxa. SILVA may not be the best fit for your goal, e.g., because misannotated and unannotated sequences are a known issue in this and most reference databases, but making a custom database is a more challenging task than first meets the eye! You are probably aware, but I just want to point this out for others following along. :smile:

2 Likes

Hi @Nicholas_Bokulich,

Thank you so much! your answer is so detailed and helpful! Really appreciate it!

Xiaolan

1 Like