training classifiers: performance of full-length vs. extract-reads

drish_k · March 20, 2020, 2:38am

Thank you @Mehrbod_Estaki for the question and @Nicholas_Bokulich for the explanation.

So does this mean that the full length 16S taxonomy classifiers are "safe" to use as classifiers for specific variable regions (for ex. V4V5 region), instead of the classifier trained on Silva 132 99% OTUs from 515F/806R region?
Would the results be vastly different (barring mixed orientation biases as observed in #3975) ?

Nicholas_Bokulich · March 20, 2020, 3:02pm

Hi @drish_k,

Absolutely safe, and more info on these considerations are given on the training classifiers tutorial at qiime2.org

Performance is slightly better, see the q2-feature-classifier paper:

Not vastly better; if you see dramatic differences, then something probably went wrong. There are several discussions on this forum of issues stemming from extract-reads, e.g., primers mispriming or not hitting (because the incorrect primers are used or include some non-biological sequence).

drish_k · March 24, 2020, 12:07am

Thanks for sharing the ref, @Nicholas_Bokulich.

I think I understand and agree with the major conclusions, but I'm confused as to why I see a difference between full length and classifier trained on V4 region when classifying mock community sequences, the latter yielding the more accurate result.
For ex. I've attached a file here with three sequences; first two from mock communities and the third from a test sample.
test_mock.qza (5.4 KB)

Below are the taxonomic classification results:

full-length classifier (silva-132-99-nb-classifier.qza) #false negatives

Feature ID	Taxon	Confidence
e1aee885aa820fc1bbc8eea6c27cdc3d	Unassigned	0.570863742
79f37fee0660e917bd1debe546718bad	Unassigned	0.502492478
5c4eda70ac981b577563232a58c14b61	D_0__Eukaryota;D_1__Archaeplastida;D_2__Chloroplastida;D_3__Charophyta;D_10__Solanales;D_11__Capsicum	0.744176858

classifier trained on V4 region (silva-132-99-515-806-nb-classifier.qza) #true positives

Feature ID	Taxon	Confidence
e1aee885aa820fc1bbc8eea6c27cdc3d	D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Lactobacillales;D_4__Lactobacillaceae;D_5__Lactobacillus	0.999999825
79f37fee0660e917bd1debe546718bad	D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Enterobacteriales;D_4__Enterobacteriaceae;D_5__Enterobacter;D_6__Enterobacter sp. UCD-UG_FMILLET	0.99430363
5c4eda70ac981b577563232a58c14b61	D_0__Eukaryota	0.863647222

classification with SINA (v1.2.11) #actual classification for the sequences

sequence_identifier	identity	lca_tax_slv
e1aee885aa820fc1bbc8eea6c27cdc3d	100	Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;
79f37fee0660e917bd1debe546718bad	100	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;
5c4eda70ac981b577563232a58c14b61	61.0478	Unclassified;

Shouldn't I be getting relatively similar results from the first two analyses?

Nicholas_Bokulich · March 24, 2020, 1:23am

Thanks for the detailed response!

There is a reference in that paper that previously made this comparison, I think this is the one: Comparison of Illumina paired-end and single-direction sequencing for microbial 16S rRNA gene amplicon surveys - PubMed

In general, though, the differences between full-length and trimmed are pretty subtle, usually trimmed may give better species-level classifications (or in other cases trimmed does worse if there is an error during the trimming, e.g., primers are missing from some reference sequences)

The differences that you are seeing are quite drastic, much greater than I've seen before in my own comparisons (as in the first reference I shared)

The explanation: it seems quite likely that the full-length classifier contains non-target DNA. SILVA contains both 16S + 18S, and it looks like that's what you used — so inclusion of the 18S sequences could be confusing the classifier because there are 18S sequences with similar kmer frequencies to the target 16S sequences. Did you intend to use the SILVA 16S + 18S reference? If not, maybe use the 16S only dataset for re-running this test.

some SILVA releases (like all reference databases!) contain some misannotated sequences as well, which will throw off the classifier for the same reason: blurred kmer profiles. Trimming will exclude non-target DNA, effectively denoising the kmer profiles used for training, and using full-length will not.

So I suppose I should add a caveat to my answer to your original question:

Yes, usually it is safe, but there are exceptions (technical issues, including non-target DNA present in the database; depends on the database; depends on the marker gene!). Since you have mock communities it is a good opportunity to "check" (and optimize) your methods before choosing the best path... and you can also use the mock communities from mockrobiota to see how well this generalizes to samples with different communities.

drish_k · March 24, 2020, 4:49am

Thanks again for the detailed explanation! This is very helpful!

The explanation: it seems quite likely that the full-length classifier contains non-target DNA. SILVA contains both 16S + 18S, and it looks like that’s what you used — so inclusion of the 18S sequences could be confusing the classifier because there are 18S sequences with similar kmer frequencies to the target 16S sequences. Did you intend to use the SILVA 16S + 18S reference? If not, maybe use the 16S only dataset for re-running this test.

...ok, this makes sense!
So, the primers used for this marine microbiome dataset, based on this study, are supposed to capture 16S and a few 18S sequences, which is why I want to use an all-inclusive database (SILVA 132/rep_set_all/99/silva132_99.fna) to annotate everything at once, instead of classifying them separately with 16S_only and 18S_only databases, based on confidence scores.

[To clarify, I have already tried to generate a primer trimmed classifier, but the results were exactly similar to the true negative annotations from the full length classifier, so I didn't include those results in my previous post.]

Nicholas_Bokulich · March 24, 2020, 3:24pm

This is what we've recommended in the past, but it definitely does not seem to be working for you. It could be a quirk of the mock community you are using, the amplicon target, or the latest SILVA release. Training separate 16S and 18S classifiers should improve performance, and since you have a mock community to verify the results this is ideal.

You could also check out classify-consensus-vsearch as an alternative method.

drish_k · March 24, 2020, 7:28pm

Alright, that was going to be my next step, using classify-consensus-vsearch instead of a pre-trained NB classifier. I've seen a couple of other posts on the forum where that worked better for people.

Thanks @Nicholas_Bokulich, this was immensely helpful! I'll keep you posted on how this goes.

system · April 25, 2020, 1:28am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.