Trouble with self-trained 18S Classifier - not assigning lower taxa levels

Hello QIIMERs.

I am running into a problem where I don't see any lower levels of taxonomy assigned to a 18S dataset. The data is from Icelandic soil, beginning at the top of a pit and traversing down about 1.5 meters into the ground. MiSeq v3 run.

Here's a Krona file to view the problem. STP-18S-single-end-SILVA-krona.qzv

For comparison, I made a 16S classifier following the same workflow below but with 16S primers that seems to be working fine - Krona file here: STP-16S-All-single-end-SILVA-krona.qzv

And I should add - the 16S and 18S data come from the exact same samples.

I am using qiime2/2023.7-RESCRIPt to train my classifier. I followed the tutorial https://forum.qiime2.org/t/processing-filtering-and-evaluating-the-silva-database-and-other-reference-sequence-data-with-rescript/ (which was very helpful!)

Here's my workflow (after downloading and importing the relevant files):

Note, the data exist within SILVA as RNA sequences, and thus have been imported as FeatureData[RNASequence]. To make sure things run smoothly downstream we'll convert the data to FeatureData[DNASequence] like so:

qiime rescript reverse-transcribe \
    --i-rna-sequences silva-138.1-ssu-nr99-rna-seqs.qza \
    --o-dna-sequences silva-138.1-ssu-nr99-seqs.qza

prepare the silva taxonomy prior to use
added in --p-no-rank-propagation to this command to obtain only the standard 6-rank taxonomy with holes. This option can be changed.

qiime rescript parse-silva-taxonomy \
    --i-taxonomy-tree taxtree-silva-138.1-nr99.qza \
    --i-taxonomy-map taxmap-silva-138.1-ssu-nr99.qza \
    --i-taxonomy-ranks taxranks-silva-138.1-ssu-nr99.qza \
    --p-no-rank-propagation \
    --o-taxonomy silva-138.1-ssu-nr99-tax.qza

remove sequences that contain 5 or more ambiguous bases (IUPAC compliant ambiguity bases) and any homopolymers that are 8 or more bases in length. These are the default parameters. See the --help text for more details.
this command takes a while

qiime rescript cull-seqs \
    --i-sequences silva-138.1-ssu-nr99-seqs.qza \
    --o-clean-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza

remove rRNA gene sequences that do not meet the following criteria: Archaea (16S) >= 900 bp, Bacteria (16S) >= 1200 bp, and any Eukaryota (18S) >= 1400 bp.

qiime rescript filter-seqs-length-by-taxon \
    --i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza \
    --i-taxonomy silva-138.1-ssu-nr99-tax.qza \
    --p-labels Archaea Bacteria Eukaryota \
    --p-min-lens 900 1200 1400 \
    --o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza \
    --o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza

Dereplication of sequences and taxonomy
here I will use lowest common ancestor - lca to back-fill the taxonomy
removed the --p-perc-identity parameter from the command as listed in the tutorial

qiime rescript dereplicate \
    --i-sequences silva-138.1-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138.1-ssu-nr99-tax.qza \
    --p-mode 'lca' \
    --o-dereplicated-sequences silva-138.1-ssu-seqs-derep-lca.qza \
    --o-dereplicated-taxa silva-138.1-ssu-tax-derep-lca.qza

Now we are ready to make our classifier for use on full-length SSU sequences

nohup qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads  silva-138.1-ssu-seqs-derep-lca.qza \
  --i-reference-taxonomy silva-138.1-ssu-tax-derep-lca.qza \
  --o-classifier silva-138.1-ssu-nr99-lca-classifier.qza &

################ Training 18S Classifier ##############

Here are the primer sequences:
5’-Adapter: AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
18S V8-9 Forward CCCTACACGACGCTCTTCCGATCTTCCGTAGGTGAACCTGCGGATAACAGGTCTGTGATGCCCT

3’-Adapter: CAAGCAGAAGACGGCATACGAGATXXXXXXXXXXXX1GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (note: I have no idea what a 1 is doing in there, this is what the sequencing core sent me)
18S V8-9 Reverse GTTCAGACGTGTGCTCTTCCGATCTCCTTCYGCAGGTTCACCTAC


qiime feature-classifier extract-reads \
    --i-sequences silva-138.1-ssu-seqs-derep-lca.qza \
    --p-f-primer TCCGTAGGTGAACCTGCGGATAACAGGTCTGTGATGCCCT \
    --p-r-primer CCTTCYGCAGGTTCACCTAC \
    --p-n-jobs 8 \
    --p-read-orientation 'forward' \
    --o-reads silva-138.1-ssu-nr99-seqs-2019-18S.qza 

qiime rescript dereplicate \
    --i-sequences silva-138.1-ssu-nr99-seqs-2019-18S.qza \
    --i-taxa silva-138.1-ssu-tax-derep-lca.qza \
    --p-mode 'lca' \
    --o-dereplicated-sequences silva-138.1-ssu-nr99-2019-18S-seqs-lca.qza \
    --o-dereplicated-taxa  silva-138.1-ssu-nr99-tax-2019-18S-derep-lca.qza 

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads silva-138.1-ssu-nr99-2019-18S-seqs-lca.qza \
    --i-reference-taxonomy silva-138.1-ssu-nr99-tax-2019-18S-derep-lca.qza  \
    --o-classifier silva-138.1-ssu-nr99-2019-18S-lca-classifier.qza

Given that the 16S classifier seemed to work fine, while the 18S didn't, and I used the exact same steps (as far as I can tell), I'm at a bit of a loss! Any suggestions are welcome! Thanks!

1 Like