Training Silva 132 classifier for qiime2-amplicon-2023.9

Hi @Mudit_Bhatia,

I see that you are using the premade QIIME files from the SILVA archive. As I've mentioned before, we've updated how we prepare and curate the SILVA reference database. That is, the files we used to upload to SILVA, were prepared quite differently than the current approach. Compare the processing notes that come with the pre-formatted QIIME SILVA 132 database on the SILVA web site vs the RESCRIPt approach.

A brief note on RESCRIPt... I should point out that the philosophy of RESCRIPt is to leave decisions of reference database curation up to the end-user. That is, there are many different perspectives on how a reference database should be curated, what taxonomic schema or nomenclatural rules should be followed, etc.. For example, below I am using the pre-clustered NR99 database, SSURef_NR99, as my staring point. But you do not have to, you can download the full raw reference data instead, i.e. 'SSURef'.

Here are the commands I used:

qiime rescript get-silva-data \
    --p-version '132' \
    --p-target 'SSURef_NR99' \
    --o-silva-sequences silva-132.0-ssu-nr99-rna-seqs.qza \
    --o-silva-taxonomy silva-132.0-ssu-nr99-tax.qza \
    --verbose

qiime rescript reverse-transcribe \
    --i-rna-sequences silva-132.0-ssu-nr99-rna-seqs.qza \
    --o-dna-sequences silva-132.0-ssu-nr99-seqs.qza

qiime rescript dereplicate \
    --i-sequences silva-132.0-ssu-nr99-seqs.qza  \
    --i-taxa silva-132.0-ssu-nr99-tax.qza \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-132.0-ssu-nr99-seqs-derep-uniq.qza \
    --o-dereplicated-taxa silva-132.0-ssu-nr99-tax-derep-uniq.qza

qiime feature-classifier extract-reads \
    --i-sequences silva-132.0-ssu-nr99-seqs-derep-uniq.qza \
    --p-f-primer CAGYMGCCRCGGKAAHACC \
    --p-r-primer GCCTGCTGCCTTCCTTGGA \
    --p-n-jobs 8 \
    --p-read-orientation 'forward' \
    --o-reads silva-132.0-ssu-nr99-seqs-F04-R22.qza

qiime rescript dereplicate \
    --i-sequences silva-132.0-ssu-nr99-seqs-F04-R22.qza \
    --i-taxa silva-132.0-ssu-nr99-tax-derep-uniq.qza \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-132.0-ssu-nr99-seqs-F04-R22-uniq.qza \
    --o-dereplicated-taxa  silva-132.0-ssu-nr99-tax-F04-R22-derep-uniq.qza

qiime feature-table tabulate-seqs \
    --i-data silva-132.0-ssu-nr99-seqs-F04-R22-uniq.qza \
    --o-visualization silva-132.0-ssu-nr99-seqs-F04-R22-uniq.qzv

qiime metadata tabulate \
    --m-input-file silva-132.0-ssu-nr99-tax-F04-R22-derep-uniq.qza \
    --o-visualization silva-132.0-ssu-nr99-tax-F04-R22-derep-uniq.qzv

I obtained 9, 547 unique reads with unique taxonomy strings. I think I made a mistake when I mentioned I obtained 32k reads earlier. I must have accidentally read in the wrong file. Sorry about that!

Sadly only ~55 of these extracted reads appeared to be from eukaryotes, the rest were bacteria and archaea. I think this has more to do with fact that many sequences in the reference database are not full-length / complete. Also, much of the reference data in SILVA may not contain this particular region of 18S, especially as this primer set amplifies the beginning of the 18S gene (positions ~12 through ~379). Which is why, even though there are many 18S sequences in SILVA, the variable region extraction approach is not working here. Either the primer sequence is not contained within the reference sequence, and you need to try another approach, see here, or that particular amplicon region is not commonly used, thus information for that region is depauperate, and is simply not available. This is why it is very important to check the availability of reference data that covers your amplicon region of interest before generating sequencing results.

I assume the full-length classifier is also not working either, right? If so, it could be partly due to the reasons outlined above.


A final note on trying to keep things consistent with other pipelines. Although you might be using the same SILVA reference database between two different tools / pipeline's, there can be differences in how the same database is curated for those pipelines. Not only are these potentially curated differently, but differences in the classification algorithm used between the pipelines could introduce other differences. This is why is usually best to stick with one approach for a given project. Otherwise you may introduce inconsistent biases through your analysis, which can negatively alter your interpretation.

1 Like