Training Silva 132 classifier for qiime2-amplicon-2023.9

Mudit_Bhatia · March 21, 2024, 2:38pm

Hi,

I have been trying to train Silva 132 classifier for qiime2-amplicon-2023.9 due to slightly higher number of available sequences and have been following the protocol mentioned Training feature classifiers with q2-feature-classifier — QIIME 2 2019.10.0 documentation. I downloaded the qiime archive for SILVA 132 Archive and use the rep_set_all/99. For the reference taxonomy, I imported the taxonomy_all/99/consensus_taxonomy_7_levels.txt.
As we have been using SSU_F04 and SSU_R22 which would result in an amplicon size of mostly 415 bp but some slightly higher to a maximum of 500 bp, while extracting reads, I used --p-trunc-len 500.
However, once the classifier is trained, the size is just around 7 Mb and the hits are not even similar to what I get with the pre-trained SILVA 138.1 full length classifier.
Is there any potential error in my workflow?

Thank You

SoilRotifer · March 21, 2024, 2:44pm

Hi @Mudit_Bhatia,

Is there a reason why you are using the older SILVA 132 database? The newer 138 classifier contains far more taxa, and updated taxonomy. This partially explains the differences you're observing in taxonomy. Additionally, later versions of the SILVA databases provided via QIIME 2 have been curated quite differently. With later versions being processed via RESCRIPt as outlined here, potentially contributing other differences.

Feel free to use that tutorial and curate the SILVA database differently if you'd like. Also, RESCRIPt is now included with the 2024.2 release of QIIME 2.

Mudit_Bhatia · March 21, 2024, 5:17pm

Thank You Mike for reaching out this quickly.

One of our collaborators from whom the data is, has been performing analysis using Mothur pipeline and they mention on their website that they observe some weird results with version 138 and 138.1. Also, on the website it shows that the sequences for Eukaryotes in these versions have decreased by around 6000 sequences (although I am not sure if that would directly translate to taxa or not).
Please find the link here.

Therefore, to keep a consistent comparison I have been trying to use the SILVA 132 data reference.

Further, I did follow the rescript protocol but it looks like that SILVA protocol is only for 16S sequences rather than both the classes (when I followed that for 132, I could only get Archae hits). I am not sure if the same workflow should work for all or there should be some modifications.

I am fairly new to the field of Bioinformatics and therefore, the questions might not be of the best quality in terms of understandability. If so, please reach out and I will try to better explain these questions.

Thank You

SoilRotifer · March 21, 2024, 5:31pm

You can use RESCRIPt to download version 132. Just set --p-version 132. You can add the flag --help to any QIIME 2 command and it will print out the help text. For example:

qiime rescript get-silva-data --help

I'm not to concerned about this. These databases are curated often, and sometimes sequences are removed if they are found to not meet current quality-control criteria. But that is just an assumption on my part.

This is not true. SILVA contains reference sequences from all three domains from both 16S and 18S. That is, all the SSU data. Remember the 16S and 18S rRNA are homologues and are in the same alignment / database. We retain all of this via RESCRIPt processing.

Not a problem at all. This is exactly why this forum exists! We can all help each other.

Mudit_Bhatia · March 21, 2024, 10:00pm

Thank You Mike.

I followed the same workflow for RESCRIPt as is mentioned here and just changed 138.1 to 132 (I used complete set of reads and did not extract them out based on the primers) but as I explained, it did not work for the 18S data.

Further, my biggest suspicion arises from the size of the trained classifier which is way smaller than the size of SILVA 132 pre-trained. Is it possible that I could share my workflow and if possible some artifact or visualization files, if that becomes of any help?

Thank You

SoilRotifer · March 22, 2024, 1:03am

This leads me to conclude that the primers you are using may not be able to amplify the 18S rRNA gene from the organisms you are interested in, or these templates are being outcompeted for by other 16S templates in the sample. Many of these primer sets are "leaky" and can amplify 16S too, and sometimes other non-SSU sequences. Or alternatively, SILVA may not contain any good 18S rRNA reference sequences for your taxa of interest. Thus, being unable to classify your target organisms. What organisms do you expect to see in your samples?

You can confirm if your target organisms are available within the SILVA database by using their online browser.

I used to use SILVA to classify my 18S invertebrate sequences (e.g. Rotifers, etc..). These and many other 18S are within the formatted files for QIIME 2. Generally speaking, any taxonomy with the domain 'Eukaryota' is likely an 18S sequence, not 16S (e.g. mitochondria / chloroplast, as these fall within Bacteria). Additionally, I also just successfully used the F04 and R22 primers to extract the amplicon region from the full length SILVA sequences.

Again, the way in which the data are curated prior to training the classifiers is quite different. One of the main reasons the file size is smaller, is that we now dereplicate the reference data, and use much less text (different formatting) of the taxonomy strings, which helps to drastically reduce the file size.

Of course! Just DM me a link to your data. I should be able to take a quick look. Anything, more than that will take me much longer.

Mudit_Bhatia · March 22, 2024, 1:25am

Thank You Mike.

I have shared the dropbox link to the training dataset for your reference. As the primers show the required hits when ran with pre-trained SILVA 138.1 and using Mothur analysis from a collaborator with SILVA 132, I would like to believe that they are not leaky. This further makes me believe that its a possible issue with my workflow or I am missing some detail with the process.

Appreciate the quick response on the matter.

SoilRotifer · March 22, 2024, 1:40am

No worries.

After looking at the provenance information of your QZAs, I noticed that you are using --p-trunc-len 500 when extracting the amplicon region from the SILVA database. This is likely the problem.

FYI, truncation will trim longer reads down to 500 bp. Any sequences shorter than 500 will be discarded. This leaves you with ~ 8k reference sequences (assuming I am looking at the correct file). When I extracted the amplicon region from 132 earlier today, I was able to get ~30k reference sequences. Thus, I suggest not using --p-trunc-len in this instance.

I wish that was the case too, but sadly ALL primers are leaky. All you can do is optimize the primer and PCR reaction to minimize off-targets. In fact this is why blocking-primers and peptide-nucleic-acid clamps were invented.

Try rerunning without --p-trunc-len and let me know how it goes.

-Mike

Mudit_Bhatia · March 23, 2024, 7:27pm

Hi Mike,

Sorry for the delay. It took some time to extract the reads without truncation.

However, I still only see around 8002 reads.

Here is the code I used:

Also,

Please find attached the qzv file after conversion.

rep-seqs-extracted.qzv (2.4 MB)

SoilRotifer · March 24, 2024, 4:46pm

Hi @Mudit_Bhatia,

I see that you are using the premade QIIME files from the SILVA archive. As I've mentioned before, we've updated how we prepare and curate the SILVA reference database. That is, the files we used to upload to SILVA, were prepared quite differently than the current approach. Compare the processing notes that come with the pre-formatted QIIME SILVA 132 database on the SILVA web site vs the RESCRIPt approach.

A brief note on RESCRIPt... I should point out that the philosophy of RESCRIPt is to leave decisions of reference database curation up to the end-user. That is, there are many different perspectives on how a reference database should be curated, what taxonomic schema or nomenclatural rules should be followed, etc.. For example, below I am using the pre-clustered NR99 database, SSURef_NR99, as my staring point. But you do not have to, you can download the full raw reference data instead, i.e. 'SSURef'.

Here are the commands I used:

qiime rescript get-silva-data \
    --p-version '132' \
    --p-target 'SSURef_NR99' \
    --o-silva-sequences silva-132.0-ssu-nr99-rna-seqs.qza \
    --o-silva-taxonomy silva-132.0-ssu-nr99-tax.qza \
    --verbose

qiime rescript reverse-transcribe \
    --i-rna-sequences silva-132.0-ssu-nr99-rna-seqs.qza \
    --o-dna-sequences silva-132.0-ssu-nr99-seqs.qza

qiime rescript dereplicate \
    --i-sequences silva-132.0-ssu-nr99-seqs.qza  \
    --i-taxa silva-132.0-ssu-nr99-tax.qza \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-132.0-ssu-nr99-seqs-derep-uniq.qza \
    --o-dereplicated-taxa silva-132.0-ssu-nr99-tax-derep-uniq.qza

qiime feature-classifier extract-reads \
    --i-sequences silva-132.0-ssu-nr99-seqs-derep-uniq.qza \
    --p-f-primer CAGYMGCCRCGGKAAHACC \
    --p-r-primer GCCTGCTGCCTTCCTTGGA \
    --p-n-jobs 8 \
    --p-read-orientation 'forward' \
    --o-reads silva-132.0-ssu-nr99-seqs-F04-R22.qza

qiime rescript dereplicate \
    --i-sequences silva-132.0-ssu-nr99-seqs-F04-R22.qza \
    --i-taxa silva-132.0-ssu-nr99-tax-derep-uniq.qza \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-132.0-ssu-nr99-seqs-F04-R22-uniq.qza \
    --o-dereplicated-taxa  silva-132.0-ssu-nr99-tax-F04-R22-derep-uniq.qza

qiime feature-table tabulate-seqs \
    --i-data silva-132.0-ssu-nr99-seqs-F04-R22-uniq.qza \
    --o-visualization silva-132.0-ssu-nr99-seqs-F04-R22-uniq.qzv

qiime metadata tabulate \
    --m-input-file silva-132.0-ssu-nr99-tax-F04-R22-derep-uniq.qza \
    --o-visualization silva-132.0-ssu-nr99-tax-F04-R22-derep-uniq.qzv

I obtained 9, 547 unique reads with unique taxonomy strings. I think I made a mistake when I mentioned I obtained 32k reads earlier. I must have accidentally read in the wrong file. Sorry about that!

Sadly only ~55 of these extracted reads appeared to be from eukaryotes, the rest were bacteria and archaea. I think this has more to do with fact that many sequences in the reference database are not full-length / complete. Also, much of the reference data in SILVA may not contain this particular region of 18S, especially as this primer set amplifies the beginning of the 18S gene (positions ~12 through ~379). Which is why, even though there are many 18S sequences in SILVA, the variable region extraction approach is not working here. Either the primer sequence is not contained within the reference sequence, and you need to try another approach, see here, or that particular amplicon region is not commonly used, thus information for that region is depauperate, and is simply not available. This is why it is very important to check the availability of reference data that covers your amplicon region of interest before generating sequencing results.

I assume the full-length classifier is also not working either, right? If so, it could be partly due to the reasons outlined above.

A final note on trying to keep things consistent with other pipelines. Although you might be using the same SILVA reference database between two different tools / pipeline's, there can be differences in how the same database is curated for those pipelines. Not only are these potentially curated differently, but differences in the classification algorithm used between the pipelines could introduce other differences. This is why is usually best to stick with one approach for a given project. Otherwise you may introduce inconsistent biases through your analysis, which can negatively alter your interpretation.

Mudit_Bhatia · March 24, 2024, 6:52pm

Thank You Mike for reaching out.

A few questions and points I would ask:

I have tried using RESCRIPT to train SILVA 132 using the pre-clustered reads both with and without the primer trimming and have had no luck in getting good alignment.
Full-length classifiers which are pre-trained and are available on qiime2 website work well and show good alignment with the found ASVs with less than 0.1% ASVs being unassigned or assigned to bacteria/archaea if at all.
I was thinking in a way that both the pre-trained classifier and classifier trained using the above protocol (with or without RESCRIPT) should ideally provide results which are coherent with each other. However, that did not look like the case.
All along I have been using the same protocols as have been mentioned and I do observe similar results as yours.
So I am wondering if this is the same method being used to train the full length classifier or that takes some different steps?

Further I could try using SSURef instead of SSURef_NR99 and see if that works?

Also, is there a method using which a classifier ".qza" file could be converted into a ".qzv" file and could be compared?

Again,

Thank You for the response and guidance.

It is a great help.

Regards

SoilRotifer · March 24, 2024, 8:12pm

Based on what I've been able to extract from the reference database, I'm not surprised.

To make sure I understand correctly, are you saying that you are able to identify most of your 18S sequences as Eukaryotes using the full length classifier? It's just the amplicon specific one is not working well?

Most of the time the differences are not substantial. But generally speaking, most report improved classification when training a classifier based on the amplicon region. However, it appears that a substantial portion of the region you need to construct an amplicon specific classifier is not completely present in the database, thus reducing the effectiveness the classifier. In these cases, we suggest using this approach, as I mentioned earlier. But it is understandable why the full-length classifier might be better in this case, as your sequences are hitting the partial region of the full length sequence, that is removed from the PCR-primer based extraction method.

Currently, I think we use the pipeline as presented, for the full length and ampicon region. Again, we provide those pre-made classifiers as a convenience, and might further optimize those classifiers in the future (some of us devs have been discussing some ideas on this). But generally, you can curate the reference database as you like. In fact, you can simply make use of the full length SSURef_NR99 as not perform any curation at all. Though I'd recommend at least dereplicating the database, to shrink the file and memory sizes for the classifier.

Not currently. But you should be able to use several of the evaluate functions in RESCRIPt, to compare the taxonomy files (the ones used as input to the classifier). See the SILVA tutorial for examples.

Mudit_Bhatia · March 24, 2024, 10:43pm

Mike,

I have been getting good results from pre-trained full length classifiers on the website and not the once I train be it full-length or amplicon targeted, which is majorly the source of my confusion. Even if the amplicon extraction does not do a good job for this pair of primers, the full-length results should still be the same for both pre-trained and trained classifiers.

As for the full length classification using RESCRIPT, I have previously performed all the steps except extraction on 132 and still was unable to get a proper classification.

I guess I will have to stick to the pre-trained classifiers as of now.

Thank you for your continuous support.

SoilRotifer · March 25, 2024, 5:57pm

This is indeed confusing.

So are you simply importing and training the classifier, or are you also performing the suite of filtering and other steps contained within the tutorial?

I ask because, one thing I suspect is that the filter-seqs-length-by-taxon step is unduly removing too many reference sequences. If you try making the classifier yourself, skip that step. In fact, simply train the classifier based on the example commands I provided earlier in this thread. That is, train the classifier using the resulting files: silva-132.0-ssu-nr99-seqs-derep-uniq.qza and silva-132.0-ssu-nr99-tax-derep-uniq.qza from this example.

Would you be willing to privately share with me your data so I can investigate how the different classifiers work on your data? This will help us optimize things moving forward.

Mudit_Bhatia · March 25, 2024, 11:44pm

Thank you Mike,

I indeed did the filter-seqs-length-by-taxon step while training the database using Rescript pipeline. I will try to work on doing this and get back to you once I have it updated.

Regarding the data, I will have to ask my collaborator who actually owns the data for this.

Thank You for the continuous support.

Regards

SoilRotifer · March 26, 2024, 12:37am

Thank you @Mudit_Bhatia!
If you figure anything out please let us know.

Not a problem at all, and totally understandable. No worries if you are unable to do so. If there are other published works that also target your organisms of interest using the same amplicon region, link me the references. I can always try downloading that public data instead and test the classifiers.

system · April 30, 2024, 2:12am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.