Thanks so much for this. Enormously helpful and much appreciated. Any difference between ver_0.01 and ver_0.02?
Also, mostly out of curiosity and interest to learn, is training the classifier directly using the primer sequences on QIIME2 not advisable? I saw your comment about retaining more sequences and was hoping you could expand on it a little.
Great question! I apologize for the lack of documentation. I should remove the older version, and will likely do so soon. But for the most part I had truncated the species labels. The reason for this, is, you may have (nearly) identical sequences that point to very slightly different species label annotations, such as:
So, if your sequence is similar to these, you’d think it should be classified as
s__Clostridioides_difficile. This will not be the case, as the specific species strings are different. What the classifier may actually return is the upper-level taxonomy g__Clostridioides.
This is not the fault of the classifier per se, but a problem of annotation which negatively affects the classifier. Because of this, I decided to only return the first two words (i.e. Clostridioides and difficile) of the “species” string. This should not affect the “no-species” labeled versions. In fact, I think the file sizes stay the same for those (there are no species labels to begin with). The files with updated species labels, should be slightly smaller in version 0.02 as the species labels are shorter. Some of the “species” labels are very long.
I also reworked some of the steps and code , so that you can run more of the steps within the QIIME 2 environment (prior code had you jumping back and forth between QIIME 1 and QIIME 2). In general, use the latest version, or follow the steps outlined in the pipeline of my original post.
I hope this clarifies things!
Got it! Thank you!
Thoughts on my question regarding advantages of using primer locations as opposed to primer sequences to train the classifier? I saw your comment about it retaining more sequences and was wondering if you could expand on it more.
Sorry I may not have framed that question properly before.
I prefer to extract the amplicon region from an alignment rather than a primer search because of one simple issue: the reality that a primer can, and often does, amplify DNA in which it was not intended to amplify. This is called specificity:
“… defined as the frequency with which a mis-priming event occurs. Primers with mediocre to poor specificity tend to produce PCR products with extra unrelated and undesirable amplicons”.
That is, I often like to know exactly where my “off-target amplicons” are coming from. By retaining as much of the reference data set in my classifier, even those that may not be a great “string match” of the primer to a given reference sequence, enables me to classify more of these off-targets, i.e. less sequences being returned as “unclassified”. This is my worry of using a primer match (even with adjustments to the mismatch parameters). However, that being said…
- The caveat of using alignment positions is that it assumes you have a well-curated and trustworthy alignment, in which those alignment positions are meaningful for the group(s) under investigation! That is, your gene can be globally aligned.
- Depending on the amplicon under study, e.g. Fungal ITS, using primer search to extract your region of interest is your only recourse, as some marker genes are very difficult (nay impossible!) to globally align.
In a nutshell , for the 16S rRNA gene data, using the curated alignment to exctract the amplicon region of interest saves me from having to run BLAST (or another tool) on most of my “unclassified” sequences, as my off-targets will be readily classified.
-Does this help?
Yes, this is great. Thank you!
@SoilRotifer Thank you for providing the qiime compatible files for silva release 138. I’m going to be using qiime 2020.2, and so I see from another user’s post that I’ll have to train the classifier, using the sequence and taxonomy .qza files you provided. I used the primers 515/806R for my sequencing. If I use the files in the EMP_V4_515f-806r folder, would I then start in at the ‘Train the classifier’ step, as explained here (i.e. I could skip the ‘Extract reference reads’ step)? Thank you!
Correct, you can skip right to the training step.
Thank you for this! I’m pretty new to all of this, so please excuse the ‘stupid questions’
I used the 515f and 926r primer set for V4-V5 - if I take the qza files in the full-length folder for the ref seqs and taxonomy you have provided and run through the classifier training instructions, replacing the primer set in the instructions with 515f and 926r, I should should end up with classifier trained on the 138 silva release, extracted with the same primers used for amplification. Re-training a classifier this way will:
- fix the scikit-learn version issues (trained and used on 0.22.1)
- end up with a classifier trained on the same region as the query sequences
- using your files that do not include species names will circumvent known issues with species level taxonomy in this release but only classify to the genus level
- from what I can tell the main difference between v. 0.01 and v 0.02 is that in the latter you truncated the species labels for consistency. This shouldn’t affect the non-species files? You mention that you tweaked the code a little and recommend using the latest version. Is this recommendation so that the version used matches the pipeline version for reproducibility? Or are the other differences between the two versions?
Thank you very much - looking forward to getting this running!
Looks like you have done some great detective work! Yes to everything! You can use the approach you linked to or the pipeline I outlined in the original post of this thread. Either should be sufficient.
For version 0.02, yes, I simply truncated the species labels to the first two words in a given string. Nothing else should differ. This change should not have an affect on the files in either version that only go to genus-level. I’ll likely remove the v 0.01 files soon. Yes, I would use the files in the v 0.02 folders if you can.
-Let us know how it goes!
Thank you for sharing your pipeline and these classifiers! I also took full lenght sequences with species labels (ver0.02) and trained them for V4-V5 (515f and 926r) classifier.
#Extract reference reads:
#Train the classifier:
qiime feature-classifier fit-classifier-naive-bayes
A closer look at taxa barplots (taxa-bar-plots_16S-V4V5-35.qzv (2.7 MB) ) shows that I have many weird named phyla in my samples (10bav-F6,LCP-89, AncK6, WPS-2, DTB120, FCPU426, MBNT15, NKB15, Sva0485, WS1, CK-2C2-2, RCP2-54, TA06, WS2, GN01, PAUC34f, NB1-j, SAR324_clade(Marine_group_B), WS4 and others).
Have I done something wrong or it just needs some additional filtering? I will filter out chloroplast and mitochondria sequences.
You’re welcome @kbitenieks!
The phyla names you see are normal. You can investigate this on the SILVA Taxonomy browser. If you click on
Bacteria you’ll see all sorts of odd names. Many of these groups (at the phyla-level and lower) are still being defined and are considered to be at the candidate (i.e. Candidatus) status.
The field of bacterial taxonomy is undergoing many changes due to leveraging genomic data to aid in taxonomic identification. Many of these taxa have no culture-type specimen and are only defined by genome sequence (or other) data. This has caused quite a bit of debate in the field of bacterial taxonomy . Anyway, this has resulted in many Candidate Phyla, and other proposed groupings.
Thanks for uploading this classifier here, I am trying to make my own classifer but it seems very memory-intensive and not possible atm. In the meantime I tired to use your pre-trained classifier for 341F-805R primer set but I have an error which I couldn’t solve based on the previous discussions in the forum. I would appreciate any help.
Here is the error:
qiime feature-classifier classify-sklearn --i-classifier classifier-consensus.qza --i-reads repseqs.qza --o-classification taxonomy
Plugin error from feature-classifier:
** The scikit-learn version (0.21.2) used to generate this artifact does not match the current version of scikit-learn installed (0.23.1). Please retrain your classifier for your current deployment to prevent data-corruption errors.**
Debug info has been saved to /tmp/qiime2-q2cli-err-w3yust9p.log
I am using qiime2.2020.6 and it seems that the version of the pluging I am using is higher than what had been used for training this classifier?
Sadly, if you would like to run the classifier in 2020.6, then you’ll have to re-train them for that version of QIIME 2, as the sklearn version changes with each update. The best option in your case might be to make use of
qiime2-2020.2 for the classification step.
See this post for more details:
Finally, check out initial post of this thread, it has been updated to redirect you to a new plugin. This should make it much easier for you to make a SILVA classifier.
When training the classifier, should one use the Moving Pictures data, or their own data? And if their own, should it be a subset of their own data or the entire data set?
You train the classifiers with standard reference data, typically from a curated database, e.g. SILVA, or of your own making. For example, you can use the input files here to construct your own classifier. For more details on how to do this, take a look at the RESCRIPt tutorial linked at the top of this thread.
Finally, you can simply make use of the pre-made classifiers.
Thanks Mike. It is working now