18S classifier using SILVA database and EMB primers

Joseph_Sevigny · February 20, 2017, 4:18pm

Hello all. This post is following up on an idea/suggestion forum here BLAST parser for assign_taxonomy.

I am constructing some classifiers for 18S amplicon data. I originally came here to get some support but figured things out via troubleshooting. Some of this may be useful to other folks using 18S. I still have some questions and thoughts for discussion so I figured I would post in ‘User support’. For the most part I am following https://docs.qiime2.org/2017.2/tutorials/feature-classifier/
Questions are at the bottom of my workflow if anyone is interested to discuss.

MY WORKFLOW:

DATABASE

https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_128_release.tgz
###OTUS
SILVA_128_QIIME_release/rep_set/rep_set_18S_only/99/99_otus_18S.fasta
###TAXONOMY
SILVA_128_QIIME_release/taxonomy/18S_only/99/majority_taxonomy_all_levels.txt

IMPORT OTUS

qiime tools import
–type FeatureData[Sequence]
–input-path SILVA_128_QIIME_release/rep_set/rep_set_18S_only/99/99_otus_18S.fasta
–output-path 18S_SILVA128_99_otus.qza

IMPORT TAXONOMY

qiime tools import
–type FeatureData[Taxonomy]
–input-path SILVA_128_QIIME_release/taxonomy/18S_only/99/majority_taxonomy_all_levels.txt
–output-path 18S_SILVA128_99_taxonomy.qza

EXTRACT EMBP variable region.

###http://www.earthmicrobiome.org/protocols-and-standards/18s/
qiime feature-classifier extract-reads
–i-sequences 18S_SILVA128_99_otus.qza
–p-f-primer GTACACACCGCCCGTC
–p-r-primer TGATCCTTCTGCAGGTTCACCTAC
–p-length 150
–o-reads ref-seqs_18S_99_SILVA128_RL150.qza

TRAIN CLASSIFIER

qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads ref-seqs_18S_99_SILVA128_RL150.qza
–i-reference-taxonomy 18S_SILVA128_99_taxonomy.qza
–o-classifier 18S_EMB_SILVA128_classifier.qza

#QUESTIONS/DISCUSSION

What exactly is happening during the ‘slicing’ step. I chose 150 for an average read length. Does this mean I should train a new classifier for 250 bp reads? With 125 bp reads? Is this creating sequences kind of like kmers to try and capture all possible reads?
Noticing there are no degenerate nucleotide codes in this primer set, how much variability does this method allow when finding the primer sites (sorry if this is documented somewhere). I found that less than half of the sequences in the database have the exact forward primer (just a quick grep). I think these primers are in need of refurbishing considering the number sequences we have gained since 2006 (is that when these were made?).
How do the taxonomy viewers use the tabular format exactly. I know the ‘levels’ are separated by ‘;’. Do they also use the ‘D_9’ etc. or is that arbitrary and can be removed to clean up visualization. i.e. does the script solely use ‘;’ to divide the levels.
I would be happy to share these classifiers with the community when I polish them up. I have also completed some classifiers with the complete 18S sequences without primer trimming.
I have compared the results to taxonomy generated with alternative classifiers. They are different in several respects. Which is better, I do not know. I am continuing to test other classifier parameters and I have also written a taxonomy assignment with blast using custom scripts and importing into qiime2.
If anyone has any other databases let me know I would be happy to help format them and construct taxonomy.txt files and later classifiers for qiime2 import.

Cheers,
Joseph Sevigny

jairideout · February 23, 2017, 5:03pm

Thanks for sharing your classifier training workflow @Joseph_Sevigny! I reached out to a couple of developers involved in feature classification, they’ll follow up here shortly to answer your questions.

BenKaehler · February 23, 2017, 8:39pm

Hello Joseph,

Thanks very much for your questions and offers of support. I will answer your questions inline.

What exactly is happening during the 'slicing' step. I chose 150 for an average read length.

The read extraction step will extract the region between the two primers if they match with a combined identity of 80% (by default, this can be tuned via the --p-identity parameter). The match is performed via a local alignment that is naïvely extended to be semi-global. The read is then trimmed to the desired length from the 5' end (or not trimmed if the --p-length parameter is omitted).

Does this mean I should train a new classifier for 250 bp reads? With 125 bp reads?

Yes. The literature on whether trimming is necessary is not unanimous and we are performing further testing, but at this stage we are recommending trimming to your actual read lengths on an experiment-wise basis.

Is this creating sequences kind of like kmers to try and capture all possible reads?

No, kmer extraction happens when you train the classifier. Read extraction extracts ordinary sequences. @jairideout, what's the best method for users to peek at what is in an Artifact? That would be instructive in this instance.

Noticing there are no degenerate nucleotide codes in this primer set, how much variability does this method allow when finding the primer sites (sorry if this is documented somewhere). I found that less than half of the sequences in the database have the exact forward primer (just a quick grep). I think these primers are in need of refurbishing considering the number sequences we have gained since 2006 (is that when these were made?).

Sorry, it's not documented yet. The amount of variability can be tuned via the --p-identity parameter, as mentioned above. I don't know when these primers were made, sorry.

How do the taxonomy viewers use the tabular format exactly. I know the 'levels' are separated by ';'. Do they also use the 'D_9' etc. or is that arbitrary and can be removed to clean up visualization. i.e. does the script solely use ';' to divide the levels.

@jairideout, is this ok for the viewers? The feature-classifier only cares about the ';' separators, anything else is arbitrary.

I would be happy to share these classifiers with the community when I polish them up. I have also completed some classifiers with the complete 18S sequences without primer trimming.

Thank you. Just so you know, QIIME 2 is still in alpha and the format is liable to change between versions. Specifically, there are some security issues for the classifiers that we intend to tidy up at some point. Also, users have to have the same version of scikit-learn installed to use a classifier trained on another system.

I have compared the results to taxonomy generated with alternative classifiers. They are different in several respects. Which is better, I do not know. I am continuing to test other classifier parameters and I have also written a taxonomy assignment with blast using custom scripts and importing into qiime2.

Thanks again, @Nicholas_Bokulich and I are actively testing the classifiers against several other methods. We hope to be able to make some conclusions on this matter soon.

If anyone has any other databases let me know I would be happy to help format them and construct taxonomy.txt files and later classifiers for qiime2 import.

Thanks very much. I have had some initial communication with @SGMcCalla about other data sets, but I haven't had a chance to follow it up yet. She may be interested in this offer. I will look into it too when I get a moment.

Thanks,
Ben

jairideout · February 24, 2017, 4:14pm

@jairideout, what's the best method for users to peek at what is in an Artifact? That would be instructive in this instance.

To see some basic information about an artifact, such as its semantic type, UUID, and the format of its underlying data, you can use qiime tools peek from the command line, or qiime2.Artifact.peek if you're using the QIIME 2 Python API.

To see the actual data stored in the artifact, you can use qiime tools export from the command line, or qiime2.Artifact.export_data using the Python API.

@jairideout, is this ok for the viewers? The feature-classifier only cares about the ';' separators, anything else is arbitrary.

This should be fine. The taxonomy visualizers only expect the taxonomy annotations to be separated by semicolons. The taxonomy annotations aren't required to have the same number of levels across all annotations either. Besides requiring the semicolon separator, the content of the taxonomy annotations is arbitrary.

ebolyen · November 16, 2017, 5:05pm

An off-topic reply has been split into a new topic: Taxonomy import for SILVA majority_taxonomy_all_levels.txt

Please keep replies on-topic in the future.