Training SILVA 18S feature classifiers and Taxonomic annotation


Now I am analysis 18S V4-5 NGS data using QIIME-compatible SILVA 132 release.

18S V4-5 Primer

Training SILVA 18S classifier:
qiime tools import --type 'FeatureData[Sequence]' --input-path SILVA_132_QIIME_release/rep_set/rep_set_18S_only/99/silva_132_99_18S.fna --output-path silva_132_99_18S_otus

I think the silva_132_99_18S.fna is the full length of 18S rRNA gene.
Q1. Why here the Min Length is 900, how to understand the short length sequence?

time qiime feature-classifier extract-reads --i-sequences silva_132_99_18S_otus.qza --p-f-primer TTAAARVGYTCGTAGTYG --p-r-primer CCGTCAATTHCTTYAART --o-reads silva_132_99_18S_616-1132-ref-seq.qza

Q2. The sequence count was decreased from 55145 to 53880, how to understand the decreased sequence?
Q3. Is there any method to know which sequences/taxonomy were removed.

After training classifier, I compared the taxa from full length and primer extracted classifier:

qiime feature-classifier classify-sklearn --i-classifier silva_132_99_18S_full_all_level_classifier.qza --i-reads representative_sequences.qza --o-classification taxonomy-all-level.qza
qiime taxa barplot --i-table table.qza --i-taxonomy taxonomy-all-level.qza --o-visualization silva-all-level-barplot --m-metadata-file metadata.tsv

D_8__Eurotiales;D_9__Aspergillaceae;D_10__Aspergillus: Aspergillus could be identified.

qiime feature-classifier classify-sklearn --i-classifier silva_132_99_18S_all-level_616-1132_classifier.qza --i-reads rep-seq-filtered.qza --o-classification taxonomy-silva_132_all-level_616-1132
qiime taxa barplot --i-table table.qza --i-taxonomy Taxonomy-silva_132_all-level_616-1132.qza --m-metadata-file metadata.tsv --o-visualization Silva-taxa-barplot_616-1132

D_8__Eurotiales;D_9__Aspergillaceae;__: The family of Aspergillaceae could be identified, but the Aspergillus could not be identified.

Q4. Which classifier is more precise? Why Aspergillus could not be identified using primer extracted classifier?

Any commons would be greatly appreciated.
Thank you so much!


I would recommend using the latest version of of the QIIME 2 formatted SILVA database (version 138) as provided on our Data resources page. The base sequence and taxonomy files, from which you can extract your amplicon region and use as input to train your own classifier are provided here. These file were generated using RESCRIPt.

Alternatively, you can download and curate your own version of the SILVA database, in any way you'd like using this tutorial as a guide.

This is generally explained in this part of the linked tutorial. Feel free to curate the database in a way that best suites your needs. Skip, reorder, or change, the various steps as needed.

Not all reference sequences within SILVA may have a region that matches your primer pairs. Or the primers may simply not match well enough. For more details, read this tutorial, which can also be used in conjunction with the base RESCRIPt SILVA tutorial.

Well you can run the following command to generate a taxonomy file that only contains the taxonomy of your representative sequences. You can run this twice, once on the full-length references, and another on the extracted amplicons:

qiime rescript filter-taxa \
    --i-taxonomy taxonomy.qza \
    --m-ids-to-keep-file rep-seqs.qza \
    --o-filtered-taxonomy rep-seqs-taxonomy.qza

Then you can tabulate (visualize) each of the outputs like so:

qiime metadata tabulate \
    --m-input-file rep-seqs-taxonomy.qza \
    --o-visualization rep-seqs-taxonomy.qzv

You can also then export the tsv file from the visualizers and view them in a spreadsheet program too.

There is no simple answer to this, sometimes an amplicon specific classifier is better, other times, not so much... However, you can use some of the RESCRIPt tools to evaluate the reference databases to one another and see which might be better overall.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.