Thank you SoilRotifer for your precious work!
I'm novice here, so if I wrote a silly question please accept my apologies.
I'm trying to train a SILVA 138 classifier with 520-926r primers.
The problem I have is at point 6 of your pipeline that gives me an error:
" filter_fasta_by_seq_id.py: error: unrecognized arguments: -f SILVA_align_seqs.fasta "
should I change -f to -i as an input file? I saw that -f is not an argument defined in your python script.
Thanks for finding that typo! Yes, it should be
-i. For any of the scripts, if you type
-h, as in:
You’ll see some help text with the appropriate options. As ya’ll are finding, this has been very much a work in progress.
@SoilRotifer thanks for the classifier.
Using these classifiers(V3V4), I noted that some taxa have the same name at lower levels:
Is it right to call them unclassified order__Clostridia_UCG-014, order__RF39, family__UCG-010 respectively?
What is the difference between these and taxa which are assigned ;__ at lower taxa levels. Eg. d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;__
To clarify, we do not curate the taxonomy, we are simply parsing the files as provided by SILVA. As for the upper-level taxonomy being propagated downward… this is intended by me. The reason for this is that not all reference sequences have taxonomic annotations for each rank. When a particular rank is missing, in this case the family level, the taxonomic rank above is propagated downward to a lower rank position until it finds a rank at a lower level, then it propagates that rank downwards towards, and so on… to the genus or species level. For example:
d__Bacteria; p__Firmicutes; c__Clostridia; o__Peptostreptococcales-Tissierellales; f__Peptostreptococcales-Tissierellales; g__Peptoniphilus; s__Peptoniphilaceae_bacterium
Note that the
o__ rank is propagated down to
f__, but we found rank information for
This is propagation of rank information is a convention followed by other research groups and tool developers too. This makes it easier to meet the requirements of various taxonomy classifiers (e.g. some tools require that all reference sequences have the same number of ranks).
While this might be technically correct, this would make parsing taxonomy onerous. Downstream analyses may fail as there would be two
o__ levels. This is why we prepend each rank with
f__ , etc… to provide some level of unique rank-level information.
Basically consider these annotations as saying: “We are using the name
Clostridia_UCG-014 to fill the
f__ slot, and again for the
g__ slot.” This is how these taxonomy annotations should be interpreted, e.g. there are many cases of
s__gut_metagenome, which is not a legitimate taxonomic rank in its own right. So, the annotation
gut_metagenome is being used to fill-in the
This is not perfect, but what we have to work with… Curating taxonomy is hard work, which is why we greatly appreciate those that do it!
As for the
;__, there are a few answers to this on the forum, but here is one explanation:
-Hope this helps!
Once again, thanks for the clarification.
In reference to these:
As for the upper-level taxonomy being propagated downward… this is intended by me…When a particular rank is missing, in this case the family level, the taxonomic rank above is propagated downward to a lower rank position until it finds a rank at a lower level, then it propagates that rank downwards towards, and so on… to the genus or species level.
This particular sequence is classified as d__Bacteria;p__Firmicutes;c__Clostridia; o__Clostridia_UCG-014;f__Clostridia_UCG-014;g__Clostridia_UCG-014;__
I guess the reference sequences do not have annotations beyond o__ rank.
Can you help me clarify this?
Right, I just realized that I did not provide enough details for one particular case… I forgot that you had mentioned the use of the V3V4 (amplicon specific) classifier. When these are made, any short sequence that became identical in sequence (after extraction of the amplicon region), but had different taxonomic annotation, will effectively have their taxonomy truncated to the lowest common ancestor (LCA). See the pipeline of the original post of this thread.
So, in this case, the lower rank information would not be available. But the base reference files do contain all the taxonomic ranks as I outlined earlier. This changes when we are extracting the amplicon region, as we are more likely to have identical sequences over various shorted regions of the gene.
Does this help?
Thanks so much for this. Enormously helpful and much appreciated. Any difference between ver_0.01 and ver_0.02?
Also, mostly out of curiosity and interest to learn, is training the classifier directly using the primer sequences on QIIME2 not advisable? I saw your comment about retaining more sequences and was hoping you could expand on it a little.
Great question! I apologize for the lack of documentation. I should remove the older version, and will likely do so soon. But for the most part I had truncated the species labels. The reason for this, is, you may have (nearly) identical sequences that point to very slightly different species label annotations, such as:
So, if your sequence is similar to these, you’d think it should be classified as
s__Clostridioides_difficile. This will not be the case, as the specific species strings are different. What the classifier may actually return is the upper-level taxonomy g__Clostridioides.
This is not the fault of the classifier per se, but a problem of annotation which negatively affects the classifier. Because of this, I decided to only return the first two words (i.e. Clostridioides and difficile) of the “species” string. This should not affect the “no-species” labeled versions. In fact, I think the file sizes stay the same for those (there are no species labels to begin with). The files with updated species labels, should be slightly smaller in version 0.02 as the species labels are shorter. Some of the “species” labels are very long.
I also reworked some of the steps and code , so that you can run more of the steps within the QIIME 2 environment (prior code had you jumping back and forth between QIIME 1 and QIIME 2). In general, use the latest version, or follow the steps outlined in the pipeline of my original post.
I hope this clarifies things!
Got it! Thank you!
Thoughts on my question regarding advantages of using primer locations as opposed to primer sequences to train the classifier? I saw your comment about it retaining more sequences and was wondering if you could expand on it more.
Sorry I may not have framed that question properly before.
I prefer to extract the amplicon region from an alignment rather than a primer search because of one simple issue: the reality that a primer can, and often does, amplify DNA in which it was not intended to amplify. This is called specificity:
“… defined as the frequency with which a mis-priming event occurs. Primers with mediocre to poor specificity tend to produce PCR products with extra unrelated and undesirable amplicons”.
That is, I often like to know exactly where my “off-target amplicons” are coming from. By retaining as much of the reference data set in my classifier, even those that may not be a great “string match” of the primer to a given reference sequence, enables me to classify more of these off-targets, i.e. less sequences being returned as “unclassified”. This is my worry of using a primer match (even with adjustments to the mismatch parameters). However, that being said…
- The caveat of using alignment positions is that it assumes you have a well-curated and trustworthy alignment, in which those alignment positions are meaningful for the group(s) under investigation! That is, your gene can be globally aligned.
- Depending on the amplicon under study, e.g. Fungal ITS, using primer search to extract your region of interest is your only recourse, as some marker genes are very difficult (nay impossible!) to globally align.
In a nutshell , for the 16S rRNA gene data, using the curated alignment to exctract the amplicon region of interest saves me from having to run BLAST (or another tool) on most of my “unclassified” sequences, as my off-targets will be readily classified.
-Does this help?
Yes, this is great. Thank you!
@SoilRotifer Thank you for providing the qiime compatible files for silva release 138. I’m going to be using qiime 2020.2, and so I see from another user’s post that I’ll have to train the classifier, using the sequence and taxonomy .qza files you provided. I used the primers 515/806R for my sequencing. If I use the files in the EMP_V4_515f-806r folder, would I then start in at the ‘Train the classifier’ step, as explained here (i.e. I could skip the ‘Extract reference reads’ step)? Thank you!
Correct, you can skip right to the training step.