@SoilRotifer thanks for the classifier.
Using these classifiers(V3V4), I noted that some taxa have the same name at lower levels:
Is it right to call them unclassified order__Clostridia_UCG-014, order__RF39, family__UCG-010 respectively?
What is the difference between these and taxa which are assigned ;__ at lower taxa levels. Eg. d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;__
To clarify, we do not curate the taxonomy, we are simply parsing the files as provided by SILVA. As for the upper-level taxonomy being propagated downward… this is intended by me. The reason for this is that not all reference sequences have taxonomic annotations for each rank. When a particular rank is missing, in this case the family level, the taxonomic rank above is propagated downward to a lower rank position until it finds a rank at a lower level, then it propagates that rank downwards towards, and so on… to the genus or species level. For example:
Note that the o__ rank is propagated down to f__, but we found rank information for g__ and s__.
This is propagation of rank information is a convention followed by other research groups and tool developers too. This makes it easier to meet the requirements of various taxonomy classifiers (e.g. some tools require that all reference sequences have the same number of ranks).
While this might be technically correct, this would make parsing taxonomy onerous. Downstream analyses may fail as there would be two o__ levels. This is why we prepend each rank with o__, f__ , etc… to provide some level of unique rank-level information.
Basically consider these annotations as saying: “We are using the name Clostridia_UCG-014 to fill the f__ slot, and again for the g__ slot.” This is how these taxonomy annotations should be interpreted, e.g. there are many cases of s__gut_metagenome, which is not a legitimate taxonomic rank in its own right. So, the annotation gut_metagenome is being used to fill-in the s__ slot.
This is not perfect, but what we have to work with… Curating taxonomy is hard work, which is why we greatly appreciate those that do it!
As for the ;__, there are a few answers to this on the forum, but here is one explanation:
As for the upper-level taxonomy being propagated downward… this is intended by me…When a particular rank is missing, in this case the family level, the taxonomic rank above is propagated downward to a lower rank position until it finds a rank at a lower level, then it propagates that rank downwards towards, and so on… to the genus or species level.
Right, I just realized that I did not provide enough details for one particular case… I forgot that you had mentioned the use of the V3V4 (amplicon specific) classifier. When these are made, any short sequence that became identical in sequence (after extraction of the amplicon region), but had different taxonomic annotation, will effectively have their taxonomy truncated to the lowest common ancestor (LCA). See the pipeline of the original post of this thread.
So, in this case, the lower rank information would not be available. But the base reference files do contain all the taxonomic ranks as I outlined earlier. This changes when we are extracting the amplicon region, as we are more likely to have identical sequences over various shorted regions of the gene.
Thanks so much for this. Enormously helpful and much appreciated. Any difference between ver_0.01 and ver_0.02?
Also, mostly out of curiosity and interest to learn, is training the classifier directly using the primer sequences on QIIME2 not advisable? I saw your comment about retaining more sequences and was hoping you could expand on it a little.
Great question! I apologize for the lack of documentation. I should remove the older version, and will likely do so soon. But for the most part I had truncated the species labels. The reason for this, is, you may have (nearly) identical sequences that point to very slightly different species label annotations, such as:
So, if your sequence is similar to these, you’d think it should be classified as s__Clostridioides_difficile. This will not be the case, as the specific species strings are different. What the classifier may actually return is the upper-level taxonomy g__Clostridioides.
This is not the fault of the classifier per se, but a problem of annotation which negatively affects the classifier. Because of this, I decided to only return the first two words (i.e. Clostridioides and difficile) of the “species” string. This should not affect the “no-species” labeled versions. In fact, I think the file sizes stay the same for those (there are no species labels to begin with). The files with updated species labels, should be slightly smaller in version 0.02 as the species labels are shorter. Some of the “species” labels are very long.
I also reworked some of the steps and code , so that you can run more of the steps within the QIIME 2 environment (prior code had you jumping back and forth between QIIME 1 and QIIME 2). In general, use the latest version, or follow the steps outlined in the pipeline of my original post.
Thoughts on my question regarding advantages of using primer locations as opposed to primer sequences to train the classifier? I saw your comment about it retaining more sequences and was wondering if you could expand on it more.
Sorry I may not have framed that question properly before.
I prefer to extract the amplicon region from an alignment rather than a primer search because of one simple issue: the reality that a primer can, and often does, amplify DNA in which it was not intended to amplify. This is called specificity:
“… defined as the frequency with which a mis-priming event occurs. Primers with mediocre to poor specificity tend to produce PCR products with extra unrelated and undesirable amplicons”.
That is, I often like to know exactly where my “off-target amplicons” are coming from. By retaining as much of the reference data set in my classifier, even those that may not be a great “string match” of the primer to a given reference sequence, enables me to classify more of these off-targets, i.e. less sequences being returned as “unclassified”. This is my worry of using a primer match (even with adjustments to the mismatch parameters). However, that being said…
The caveat of using alignment positions is that it assumes you have a well-curated and trustworthy alignment, in which those alignment positions are meaningful for the group(s) under investigation! That is, your gene can be globally aligned.
Depending on the amplicon under study, e.g. Fungal ITS, using primer search to extract your region of interest is your only recourse, as some marker genes are very difficult (nay impossible!) to globally align.
In a nutshell , for the 16S rRNA gene data, using the curated alignment to exctract the amplicon region of interest saves me from having to run BLAST (or another tool) on most of my “unclassified” sequences, as my off-targets will be readily classified.
@SoilRotifer Thank you for providing the qiime compatible files for silva release 138. I’m going to be using qiime 2020.2, and so I see from another user’s post that I’ll have to train the classifier, using the sequence and taxonomy .qza files you provided. I used the primers 515/806R for my sequencing. If I use the files in the EMP_V4_515f-806r folder, would I then start in at the ‘Train the classifier’ step, as explained here (i.e. I could skip the ‘Extract reference reads’ step)? Thank you!
Thank you for this! I’m pretty new to all of this, so please excuse the ‘stupid questions’
I used the 515f and 926r primer set for V4-V5 - if I take the qza files in the full-length folder for the ref seqs and taxonomy you have provided and run through the classifier training instructions, replacing the primer set in the instructions with 515f and 926r, I should should end up with classifier trained on the 138 silva release, extracted with the same primers used for amplification. Re-training a classifier this way will:
fix the scikit-learn version issues (trained and used on 0.22.1)
end up with a classifier trained on the same region as the query sequences
using your files that do not include species names will circumvent known issues with species level taxonomy in this release but only classify to the genus level
from what I can tell the main difference between v. 0.01 and v 0.02 is that in the latter you truncated the species labels for consistency. This shouldn’t affect the non-species files? You mention that you tweaked the code a little and recommend using the latest version. Is this recommendation so that the version used matches the pipeline version for reproducibility? Or are the other differences between the two versions?
Thank you very much - looking forward to getting this running!
Looks like you have done some great detective work! Yes to everything! You can use the approach you linked to or the pipeline I outlined in the original post of this thread. Either should be sufficient.
For version 0.02, yes, I simply truncated the species labels to the first two words in a given string. Nothing else should differ. This change should not have an affect on the files in either version that only go to genus-level. I’ll likely remove the v 0.01 files soon. Yes, I would use the files in the v 0.02 folders if you can.