Trouble with self-trained 18S Classifier - not assigning lower taxa levels

16sIceland · April 24, 2024, 9:09pm

Hello QIIMERs.

I am running into a problem where I don't see any lower levels of taxonomy assigned to a 18S dataset. The data is from Icelandic soil, beginning at the top of a pit and traversing down about 1.5 meters into the ground. MiSeq v3 run.

Here's a Krona file to view the problem. STP-18S-single-end-SILVA-krona.qzv

For comparison, I made a 16S classifier following the same workflow below but with 16S primers that seems to be working fine - Krona file here: STP-16S-All-single-end-SILVA-krona.qzv

And I should add - the 16S and 18S data come from the exact same samples.

I am using qiime2/2023.7-RESCRIPt to train my classifier. I followed the tutorial https://forum.qiime2.org/t/processing-filtering-and-evaluating-the-silva-database-and-other-reference-sequence-data-with-rescript/ (which was very helpful!)

Here's my workflow (after downloading and importing the relevant files):

Note, the data exist within SILVA as RNA sequences, and thus have been imported as FeatureData[RNASequence]. To make sure things run smoothly downstream we'll convert the data to FeatureData[DNASequence] like so:

qiime rescript reverse-transcribe \
    --i-rna-sequences silva-138.1-ssu-nr99-rna-seqs.qza \
    --o-dna-sequences silva-138.1-ssu-nr99-seqs.qza

prepare the silva taxonomy prior to use
added in --p-no-rank-propagation to this command to obtain only the standard 6-rank taxonomy with holes. This option can be changed.

qiime rescript parse-silva-taxonomy \
    --i-taxonomy-tree taxtree-silva-138.1-nr99.qza \
    --i-taxonomy-map taxmap-silva-138.1-ssu-nr99.qza \
    --i-taxonomy-ranks taxranks-silva-138.1-ssu-nr99.qza \
    --p-no-rank-propagation \
    --o-taxonomy silva-138.1-ssu-nr99-tax.qza

remove sequences that contain 5 or more ambiguous bases (IUPAC compliant ambiguity bases) and any homopolymers that are 8 or more bases in length. These are the default parameters. See the --help text for more details.
this command takes a while

qiime rescript cull-seqs \
    --i-sequences silva-138.1-ssu-nr99-seqs.qza \
    --o-clean-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza

remove rRNA gene sequences that do not meet the following criteria: Archaea (16S) >= 900 bp, Bacteria (16S) >= 1200 bp, and any Eukaryota (18S) >= 1400 bp.

qiime rescript filter-seqs-length-by-taxon \
    --i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza \
    --i-taxonomy silva-138.1-ssu-nr99-tax.qza \
    --p-labels Archaea Bacteria Eukaryota \
    --p-min-lens 900 1200 1400 \
    --o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza \
    --o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza

Dereplication of sequences and taxonomy
here I will use lowest common ancestor - lca to back-fill the taxonomy
removed the --p-perc-identity parameter from the command as listed in the tutorial

qiime rescript dereplicate \
    --i-sequences silva-138.1-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138.1-ssu-nr99-tax.qza \
    --p-mode 'lca' \
    --o-dereplicated-sequences silva-138.1-ssu-seqs-derep-lca.qza \
    --o-dereplicated-taxa silva-138.1-ssu-tax-derep-lca.qza

Now we are ready to make our classifier for use on full-length SSU sequences

nohup qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads  silva-138.1-ssu-seqs-derep-lca.qza \
  --i-reference-taxonomy silva-138.1-ssu-tax-derep-lca.qza \
  --o-classifier silva-138.1-ssu-nr99-lca-classifier.qza &

################ Training 18S Classifier ##############

Here are the primer sequences:
5’-Adapter: AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
18S V8-9 Forward CCCTACACGACGCTCTTCCGATCTTCCGTAGGTGAACCTGCGGATAACAGGTCTGTGATGCCCT

3’-Adapter: CAAGCAGAAGACGGCATACGAGATXXXXXXXXXXXX1GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (note: I have no idea what a 1 is doing in there, this is what the sequencing core sent me)
18S V8-9 Reverse GTTCAGACGTGTGCTCTTCCGATCTCCTTCYGCAGGTTCACCTAC


qiime feature-classifier extract-reads \
    --i-sequences silva-138.1-ssu-seqs-derep-lca.qza \
    --p-f-primer TCCGTAGGTGAACCTGCGGATAACAGGTCTGTGATGCCCT \
    --p-r-primer CCTTCYGCAGGTTCACCTAC \
    --p-n-jobs 8 \
    --p-read-orientation 'forward' \
    --o-reads silva-138.1-ssu-nr99-seqs-2019-18S.qza 

qiime rescript dereplicate \
    --i-sequences silva-138.1-ssu-nr99-seqs-2019-18S.qza \
    --i-taxa silva-138.1-ssu-tax-derep-lca.qza \
    --p-mode 'lca' \
    --o-dereplicated-sequences silva-138.1-ssu-nr99-2019-18S-seqs-lca.qza \
    --o-dereplicated-taxa  silva-138.1-ssu-nr99-tax-2019-18S-derep-lca.qza 

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads silva-138.1-ssu-nr99-2019-18S-seqs-lca.qza \
    --i-reference-taxonomy silva-138.1-ssu-nr99-tax-2019-18S-derep-lca.qza  \
    --o-classifier silva-138.1-ssu-nr99-2019-18S-lca-classifier.qza

Given that the 16S classifier seemed to work fine, while the 18S didn't, and I used the exact same steps (as far as I can tell), I'm at a bit of a loss! Any suggestions are welcome! Thanks!

UPDATE

I've been looking back at the 16S data and it seems to have the same problem. I have a little R script that I use to aggregate taxonomy files, and I'm seeing a big change. Here's the R script, if anyone is interested:

# load the packages
library(qiime2R)
library(tidyverse)
library(dplyr)
# set working directory

# load the necessary files
taxonomy <- read_qza("STP-18S-single-end-SILVA-taxonomy.qza")$data
SVs <- read_qza("STP-18S-single-end-trimmed-table-5.qza")$data

# The first column of the SVs file isn't really a column, it is the row names
# We must make it a real column and give it a name so it will merge with taxonomy
SVs <- as.data.frame(SVs)
SVs <- tibble::rownames_to_column(SVs, "Feature.ID")

# join the read counts in the SVs file, with the taxonomy data 
taxonomyCounts <- merge(taxonomy, SVs) 

# remove unnecessary columns, such as Feature.ID, Confidence and Mock1
# check to make sure you're removing the correct columns!!! 
taxonomyCounts <- taxonomyCounts[,c(2,7:19)]

# Condense the table so that any duplicated taxonomy calls are summed into one row
# eg there are currently multiple rows labeled "D_0_Bacteria" and they will all be 
# collapsed into one row labeled "D_0_Baceria"
# The . ~ tells the command to aggregate all the columns by the Taxon column
taxCounts <- aggregate(. ~ Taxon, taxonomyCounts, sum)

When I run this, I end up with the following numbers:

Using my trained classifier, 18S data starts with 1254 observations, ends with 32 (!!)
Using my trained classifier, the 16S data starts with 6263 observations, ends with 42 (!!)
Using the provided SILVA classifier, the 16S data starts with 6263 observations and ends with 1314
*Using the provided Greengenes2 classifier, the 16S data starts with 6263 observations and ends with 1440

So I have messed up somewhere in both classifiers (16S and 18S) that I tried to make myself.

Nicholas_Bokulich · May 7, 2024, 4:48am

Hi @16sIceland ,

I am not able to view the files that you linked to, but I suspect that this is your issue:

This is a very long primer sequence! I suspect that this is not just the primer, but also some other piece of the adapter/primer construct used for library prep (i.e., containing more than the actual primer sequence). So you might be accidentally discarding a lot of sequences because they do not contain this full sequence. Same for the 16S.

It looks like maybe you are using the V8 primer from this study? Design and Evaluation of Illumina MiSeq-Compatible, 18S rRNA Gene-Specific Primers for Improved Characterization of Mixed Phototrophic Communities - PMC

in which case you should use ATAACAGGTCTGTGATGCCCT for the primer sequence, not the full construct (which also contains non-biological sequence).

Good luck!

16sIceland · May 7, 2024, 2:31pm

Thanks for the reply!

When I went through the workflow, I selected the sequence by taking the Illumina adapters and aligning them to the 18S/16S primer sequences.

I didn't realize there was still sequence in the primers that was not biological!

After re-making the classifier, I was able to improve the taxonomy from 32 unique classifications to 254 (from 1254 ASVs). Next step: look into other databases for 18S sequences.

system · June 7, 2024, 8:32pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.