Poor classification of MOCK community

Robin_Mjelle · April 14, 2020, 4:38pm

I have performed 16S seq on this mock community:
https://www.lgcstandards-atcc.org/products/all/MSA-2002.aspx

When using Kraken, I am able to find 19 of the 20 species within the community. However, when using Qiime2 and the four Naive Bayes classifiers trained on Silva and Greengenes, I am only able to identify a few of the species.
Any idea how to improve accuracy?

Nicholas_Bokulich · April 14, 2020, 4:57pm

Hi @Robin_Mjelle,

What primer set are you using? You will want to make sure that the classifier is trained on the same subdomain for optimal accuracy.

You can reduce the --p-confidence setting to increase recall at the expense of precision (in other words, you will get more species-level classifications)

These classifiers are designed to be quite conservative, i.e., reduce false-positives at the expense of species-level classification. So seeing mostly genus-level classifications on 16S fragments with default settings is quite normal, and is designed that way to avoid errors.

Note that mock communities are generally made up of well-characterized, easy-to-classify species, so adjusting confidence parameters to achieve perfect species-level classification on mock communities is usually quite achievable... in this paper we showed that we could get perfect classification of some mock communities with any classification method by adjusting the parameters one way... but those settings often proved fatal for classification of more difficult-to-classify species elsewhere on the tree of life. This is how we landed on the conservative default settings in q2-feature-classifier.

So feel free to adjust confidence — we even have recommendations in that paper linked above for improving recall — but be aware that better species-level classification on a mock community is sometimes an indicator of an "overly confident" classifier that can make mistakes!

Robin_Mjelle · April 14, 2020, 5:06pm

Hi,
these are the primers, however, I am not sure how to train using these specific primers:

"16S Amplicon PCR Forward Primer: “5'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG”

16S Amplicon PCR Reverse Primer: 5' GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC"

Is this how I train? Could I use taxonomy_silva-132-99.qza for training?
My reads are 300nts long.

qiime feature-classifier extract-reads
--i-sequences taxonomy_silva-132-99.qza
--p-f-primer TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG
--p-r-primer GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC
--p-trunc-len 120
--p-min-length 100
--p-max-length 400
--o-reads ref-seqs.qza

Nicholas_Bokulich · April 14, 2020, 5:11pm

Looks like you are targeting 341-805 (V3-V4) — so this could explain part of your issue (the V4 pre-trained classifiers will perform poorly since they only cover half of your amplicons; the full-length classifiers are sub-optimal; a V3-V4 classifier would be best)

Those are the adapters + primers, so contain some non-biological sequence that you want to exclude when extracting reads for training a primer. So your example command would not work — you would not extract any reads because the adapter portion would not match anything.

I have just the thing for you! See this topic:

You can follow those directions to train your own V3-V4 classifier... and please let me know what you find!

Robin_Mjelle · April 14, 2020, 5:33pm

Stupid question: Where do I find the 99_otus.fasta

Nicholas_Bokulich · April 14, 2020, 5:37pm

download the full gg_13_8 release (check out the data resources page at qiime2.org, the full database is linked below the pre-trained classifiers that you used)
unzip
the file will be located here: gg_13_8_otus/rep_set/99_otus.fasta

Those instructions could be used with any 16S reference database... so you could use SILVA or Greengenes or GTDB, or the best comparison to Kraken would be to use the same exact reference database that was used to train Kraken.

Robin_Mjelle · April 15, 2020, 1:56pm

Hi,
I am still not able to classify very well the mock community. Several wrong genera are identified.
Here is my entire pipeline, I used this reference file:

https://greengenes.secondgenome.com/?prefix=downloads/greengenes_database/gg_12_10/:

Make classifyer

qiime tools import
--input-path gg_12_10.fasta
--output-path ./gg_12_10.fasta.qza
--type 'FeatureData[Sequence]'

qiime tools import
--input-path gg_12_10_taxonomy.txt
--output-path ./gg_12_10_taxonomy.gza
--type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat

qiime feature-classifier extract-reads
--i-sequences ./gg_12_10.fasta.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GACTACHVGGGTATCTAATCC
--p-trunc-len 450
--p-min-length 100
--p-max-length 600
--o-reads ./gg_12_10_341_805.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ./gg_12_10_341_805.qza
--i-reference-taxonomy ./gg_12_10_taxonomy.gza.gza
--o-classifier ./gg_12_10_341_805_classifer.qza

Run it

qiime feature-classifier classify-sklearn
--i-classifier gg_12_10_341_805_classifer.qza
--i-reads dada2_rep_set.qza
--p-confidence 0
--o-classification taxonomy_gg_12_10_341_805_confidence0.qza

qiime metadata tabulate
--m-input-file taxonomy_gg_12_10_341_805_confidence0.qza
--o-visualization taxonomy_gg_12_10_341_805_confidence0.qzv

qiime taxa barplot
--i-table ./table_2k.qza
--i-taxonomy ./taxonomy_gg_12_10_341_805_confidence0.qza
--m-metadata-file ./metadata.tsv
--o-visualization ./taxa_barplot_taxonomy_gg_12_10_341_805_confidence0.qzv

Nicholas_Bokulich · April 15, 2020, 3:04pm

Hi @Robin_Mjelle,
A few issues here:

gg_12_10 is ancient, 13_8 is also quite old but contains various updates to taxonomy, etc. I recommend using 13_8 at the very least, but trying out the latest release of SILVA or GTDB would be better. The best thing, if you are comparing this against kraken or other methods, is to use the exact same reference database for all methods, so that you know that differences are due to the method, not the database.

Zero confidence is akin to top-hit blast; the classifier will just grab the closest hit without considering other similar hits. This is good for testing but not the most accurate method (results may vary: it depends on the contents of the mock community and the reference database but in general confidence=0 is a poor choice)

This is beginning to sound quite suspicious and I wonder if some technical error is afoot. Can you tell me a little more about your sequencing protocol. Do you know if the reads are in mixed orientations? Could you share a QZV barplot of your results?

Thanks!

Robin_Mjelle · April 15, 2020, 4:49pm

Thanks,
18 of the 20 species were detected using https://gtdb.ecogenomic.org/

Nicholas_Bokulich · April 15, 2020, 4:51pm

Excellent! That's more in line with expected performance. What species were missing and what false positives did you get? Note that GTDB has an altered taxonomy (to better reflect the molecular phylogeny), so any species you are missing just might have been renamed... you can search for the missing species on the GTDB site to see if they were renamed.

Robin_Mjelle · April 15, 2020, 4:59pm

Some false positives. I can send the .csv file if you are interested.

Staphylococcus aureus and Staphylococcus epidermidis seem to be missing.

Staphylococcus argenteu, Staphylococcus caprae and Staphylococcus simiae were detected instead

Nicholas_Bokulich · April 15, 2020, 6:34pm

Yep, I seem to recall those two being quite similar

Looking at GTDB, it looks like a number of S. aureus sequences were renamed S. argenteus in GTDB, so this could be a nomenclature issue. The others are probably misclassification (e.g., due to sequencing error on the other Staph isolates)

SoilRotifer · April 15, 2020, 9:36pm

Just as an FYI,

If you try it, let me know how well it works.

Note, you may have to retrain the classifier for the latest version of QIIME 2. If so, just use the base sequence and taxonomy files provided.

-Mike

Robin_Mjelle · April 16, 2020, 8:46am

A question regarding the databases. Does GTDB contain everything that is within Silva and GG?

Robin_Mjelle · April 16, 2020, 9:04am

Btw:
I used the file "bac120_ssu.fna". Should I instead use "ssu.fna"?
What is the main difference between the two?

https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/

Nicholas_Bokulich · April 16, 2020, 2:59pm

No I believe the source data are all slightly different. GTDB's sequences are from Genbank refseq. SILVA's are from EMBL. Greengenes I am not sure, probably Genbank as of 7 years ago.

BUT GTDB should have all the same taxonomic groups (even if some of them are re-named).

No clue — I suggest reading the information on the GTDB site, there is probably a README with that information.