SILVA 138 Classifiers

SoilRotifer · March 6, 2020, 2:24pm

Thanks for finding that typo! Yes, it should be -i. For any of the scripts, if you type -h, as in:

filter_fasta_by_seq_id.py -h

You'll see some help text with the appropriate options. As ya'll are finding, this has been very much a work in progress.

-Mike

prince · March 27, 2020, 4:59am

@SoilRotifer thanks for the classifier.
Using these classifiers(V3V4), I noted that some taxa have the same name at lower levels:
d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridia_UCG-014;f__Clostridia_UCG-014;g__Clostridia_UCG-014

d__Bacteria;p__Firmicutes;c__Bacilli;o__RF39;f__RF39;g__RF39

d__Bacteria;p__Firmicutes;c__Clostridia;o__Oscillospirales;f__UCG-010;g__UCG-010

Is it right to call them unclassified order__Clostridia_UCG-014, order__RF39, family__UCG-010 respectively?

What is the difference between these and taxa which are assigned ;__ at lower taxa levels. Eg. d__Bacteria;p__Firmicutes;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;__
or d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;;

Thanks

SoilRotifer · March 27, 2020, 5:14pm

Hi @prince,

To clarify, we do not curate the taxonomy, we are simply parsing the files as provided by SILVA. As for the upper-level taxonomy being propagated downward... this is intended by me. The reason for this is that not all reference sequences have taxonomic annotations for each rank. When a particular rank is missing, in this case the family level, the taxonomic rank above is propagated downward to a lower rank position until it finds a rank at a lower level, then it propagates that rank downwards towards, and so on... to the genus or species level. For example:

d__Bacteria; p__Firmicutes; c__Clostridia; o__Peptostreptococcales-Tissierellales; f__Peptostreptococcales-Tissierellales; g__Peptoniphilus; s__Peptoniphilaceae_bacterium

Note that the o__ rank is propagated down to f__, but we found rank information for g__ and s__.

This is propagation of rank information is a convention followed by other research groups and tool developers too. This makes it easier to meet the requirements of various taxonomy classifiers (e.g. some tools require that all reference sequences have the same number of ranks).

While this might be technically correct, this would make parsing taxonomy onerous. Downstream analyses may fail as there would be two o__ levels. This is why we prepend each rank with o__, f__ , etc... to provide some level of unique rank-level information.

Basically consider these annotations as saying: "We are using the name Clostridia_UCG-014 to fill the f__ slot, and again for the g__ slot." This is how these taxonomy annotations should be interpreted, e.g. there are many cases of s__gut_metagenome, which is not a legitimate taxonomic rank in its own right. So, the annotation gut_metagenome is being used to fill-in the s__ slot.

This is not perfect, but what we have to work with... Curating taxonomy is hard work, which is why we greatly appreciate those that do it!

As for the ;__, there are a few answers to this on the forum, but here is one explanation:

-Hope this helps!
-Best wishes.

prince · March 30, 2020, 1:10am

Thanks @SoilRotifer.
Got it

prince · March 30, 2020, 6:53am

Hi @SoilRotifer,
Once again, thanks for the clarification.

In reference to these:

As for the upper-level taxonomy being propagated downward… this is intended by me........When a particular rank is missing, in this case the family level, the taxonomic rank above is propagated downward to a lower rank position until it finds a rank at a lower level, then it propagates that rank downwards towards, and so on… to the genus or species level.

is
d__Bacteria;p__Firmicutes;c__Clostridia; o__Clostridia_UCG-014;f__Clostridia_UCG-014;g__Clostridia_UCG-014

equivalent to

d__Bacteria;p__Firmicutes;c__Clostridia; o__Clostridia_UCG-014;f__;g__

This particular sequence is classified as d__Bacteria;p__Firmicutes;c__Clostridia; o__Clostridia_UCG-014;f__Clostridia_UCG-014;g__Clostridia_UCG-014;__

I guess the reference sequences do not have annotations beyond o__ rank.

Can you help me clarify this?

SoilRotifer · March 30, 2020, 1:48pm

Right, I just realized that I did not provide enough details for one particular case... I forgot that you had mentioned the use of the V3V4 (amplicon specific) classifier. When these are made, any short sequence that became identical in sequence (after extraction of the amplicon region), but had different taxonomic annotation, will effectively have their taxonomy truncated to the lowest common ancestor (LCA). See the pipeline of the original post of this thread.

So, in this case, the lower rank information would not be available. But the base reference files do contain all the taxonomic ranks as I outlined earlier. This changes when we are extracting the amplicon region, as we are more likely to have identical sequences over various shorted regions of the gene.

Does this help?

-Best
-Mike

prince · March 31, 2020, 10:53am

Yes.
Thanks @SoilRotifer

Anahita_Bharadwaj · April 15, 2020, 2:25pm

Thanks so much for this. Enormously helpful and much appreciated. Any difference between ver_0.01 and ver_0.02?

Also, mostly out of curiosity and interest to learn, is training the classifier directly using the primer sequences on QIIME2 not advisable? I saw your comment about retaining more sequences and was hoping you could expand on it a little.

SoilRotifer · April 15, 2020, 4:26pm

Hi @Anahita_Bharadwaj,

Great question! I apologize for the lack of documentation. I should remove the older version, and will likely do so soon. But for the most part I had truncated the species labels. The reason for this, is, you may have (nearly) identical sequences that point to very slightly different species label annotations, such as:

s__Clostridioides_difficile
s__Clostridioides_difficile_R20291

So, if your sequence is similar to these, you'd think it should be classified as s__Clostridioides_difficile. This will not be the case, as the specific species strings are different. What the classifier may actually return is the upper-level taxonomy g__Clostridioides.

This is not the fault of the classifier per se, but a problem of annotation which negatively affects the classifier. Because of this, I decided to only return the first two words (i.e. Clostridioides and difficile) of the "species" string. This should not affect the "no-species" labeled versions. In fact, I think the file sizes stay the same for those (there are no species labels to begin with). The files with updated species labels, should be slightly smaller in version 0.02 as the species labels are shorter. Some of the "species" labels are very long.

I also reworked some of the steps and code , so that you can run more of the steps within the QIIME 2 environment (prior code had you jumping back and forth between QIIME 1 and QIIME 2). In general, use the latest version, or follow the steps outlined in the pipeline of my original post.

I hope this clarifies things!

Anahita_Bharadwaj · April 15, 2020, 4:44pm

Got it! Thank you!

Thoughts on my question regarding advantages of using primer locations as opposed to primer sequences to train the classifier? I saw your comment about it retaining more sequences and was wondering if you could expand on it more.

Sorry I may not have framed that question properly before.

SoilRotifer · April 15, 2020, 5:18pm

Oh right!

I prefer to extract the amplicon region from an alignment rather than a primer search because of one simple issue: the reality that a primer can, and often does, amplify DNA in which it was not intended to amplify. This is called specificity:

"... defined as the frequency with which a mis-priming event occurs. Primers with mediocre to poor specificity tend to produce PCR products with extra unrelated and undesirable amplicons".

That is, I often like to know exactly where my "off-target amplicons" are coming from. By retaining as much of the reference data set in my classifier, even those that may not be a great "string match" of the primer to a given reference sequence, enables me to classify more of these off-targets, i.e. less sequences being returned as "unclassified". This is my worry of using a primer match (even with adjustments to the mismatch parameters). However, that being said...

The caveat of using alignment positions is that it assumes you have a well-curated and trustworthy alignment, in which those alignment positions are meaningful for the group(s) under investigation! That is, your gene can be globally aligned.
Depending on the amplicon under study, e.g. Fungal ITS, using primer search to extract your region of interest is your only recourse, as some marker genes are very difficult (nay impossible!) to globally align.

In a nutshell , for the 16S rRNA gene data, using the curated alignment to exctract the amplicon region of interest saves me from having to run BLAST (or another tool) on most of my "unclassified" sequences, as my off-targets will be readily classified.

-Does this help?

Anahita_Bharadwaj · April 15, 2020, 6:18pm

Yes, this is great. Thank you!

EmilieS · April 22, 2020, 3:28pm

@SoilRotifer Thank you for providing the qiime compatible files for silva release 138. I'm going to be using qiime 2020.2, and so I see from another user's post that I'll have to train the classifier, using the sequence and taxonomy .qza files you provided. I used the primers 515/806R for my sequencing. If I use the files in the EMP_V4_515f-806r folder, would I then start in at the 'Train the classifier' step, as explained here (i.e. I could skip the 'Extract reference reads' step)? Thank you!

SoilRotifer · April 22, 2020, 4:31pm

Hi @EmilieS

Correct, you can skip right to the training step.

-Mike

hsapers · April 27, 2020, 6:27am

Thank you for this! I'm pretty new to all of this, so please excuse the 'stupid questions'

I used the 515f and 926r primer set for V4-V5 - if I take the qza files in the full-length folder for the ref seqs and taxonomy you have provided and run through the classifier training instructions, replacing the primer set in the instructions with 515f and 926r, I should should end up with classifier trained on the 138 silva release, extracted with the same primers used for amplification. Re-training a classifier this way will:

fix the scikit-learn version issues (trained and used on 0.22.1)
end up with a classifier trained on the same region as the query sequences
using your files that do not include species names will circumvent known issues with species level taxonomy in this release but only classify to the genus level
from what I can tell the main difference between v. 0.01 and v 0.02 is that in the latter you truncated the species labels for consistency. This shouldn't affect the non-species files? You mention that you tweaked the code a little and recommend using the latest version. Is this recommendation so that the version used matches the pipeline version for reproducibility? Or are the other differences between the two versions?

Thank you very much - looking forward to getting this running!

SoilRotifer · April 27, 2020, 7:16pm

Hi @hsapers,

Looks like you have done some great detective work! Yes to everything! You can use the approach you linked to or the pipeline I outlined in the original post of this thread. Either should be sufficient.

For version 0.02, yes, I simply truncated the species labels to the first two words in a given string. Nothing else should differ. This change should not have an affect on the files in either version that only go to genus-level. I'll likely remove the v 0.01 files soon. Yes, I would use the files in the v 0.02 folders if you can.

-Let us know how it goes!
-Best wishes!
-Mike