Hello QIIMERs.
I am running into a problem where I don't see any lower levels of taxonomy assigned to a 18S dataset. The data is from Icelandic soil, beginning at the top of a pit and traversing down about 1.5 meters into the ground. MiSeq v3 run.
Here's a Krona file to view the problem. STP-18S-single-end-SILVA-krona.qzv
For comparison, I made a 16S classifier following the same workflow below but with 16S primers that seems to be working fine - Krona file here: STP-16S-All-single-end-SILVA-krona.qzv
And I should add - the 16S and 18S data come from the exact same samples.
I am using qiime2/2023.7-RESCRIPt
to train my classifier. I followed the tutorial https://forum.qiime2.org/t/processing-filtering-and-evaluating-the-silva-database-and-other-reference-sequence-data-with-rescript/ (which was very helpful!)
Here's my workflow (after downloading and importing the relevant files):
Note, the data exist within SILVA as RNA sequences, and thus have been imported as FeatureData[RNASequence]. To make sure things run smoothly downstream we'll convert the data to FeatureData[DNASequence] like so:
qiime rescript reverse-transcribe \
--i-rna-sequences silva-138.1-ssu-nr99-rna-seqs.qza \
--o-dna-sequences silva-138.1-ssu-nr99-seqs.qza
prepare the silva taxonomy prior to use
added in --p-no-rank-propagation to this command to obtain only the standard 6-rank taxonomy with holes. This option can be changed.
qiime rescript parse-silva-taxonomy \
--i-taxonomy-tree taxtree-silva-138.1-nr99.qza \
--i-taxonomy-map taxmap-silva-138.1-ssu-nr99.qza \
--i-taxonomy-ranks taxranks-silva-138.1-ssu-nr99.qza \
--p-no-rank-propagation \
--o-taxonomy silva-138.1-ssu-nr99-tax.qza
remove sequences that contain 5 or more ambiguous bases (IUPAC compliant ambiguity bases) and any homopolymers that are 8 or more bases in length. These are the default parameters. See the --help text for more details.
this command takes a while
qiime rescript cull-seqs \
--i-sequences silva-138.1-ssu-nr99-seqs.qza \
--o-clean-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza
remove rRNA gene sequences that do not meet the following criteria: Archaea (16S) >= 900 bp, Bacteria (16S) >= 1200 bp, and any Eukaryota (18S) >= 1400 bp.
qiime rescript filter-seqs-length-by-taxon \
--i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza \
--i-taxonomy silva-138.1-ssu-nr99-tax.qza \
--p-labels Archaea Bacteria Eukaryota \
--p-min-lens 900 1200 1400 \
--o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza \
--o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza
Dereplication of sequences and taxonomy
here I will use lowest common ancestor - lca to back-fill the taxonomy
removed the --p-perc-identity parameter from the command as listed in the tutorial
qiime rescript dereplicate \
--i-sequences silva-138.1-ssu-nr99-seqs-filt.qza \
--i-taxa silva-138.1-ssu-nr99-tax.qza \
--p-mode 'lca' \
--o-dereplicated-sequences silva-138.1-ssu-seqs-derep-lca.qza \
--o-dereplicated-taxa silva-138.1-ssu-tax-derep-lca.qza
Now we are ready to make our classifier for use on full-length SSU sequences
nohup qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads silva-138.1-ssu-seqs-derep-lca.qza \
--i-reference-taxonomy silva-138.1-ssu-tax-derep-lca.qza \
--o-classifier silva-138.1-ssu-nr99-lca-classifier.qza &
################ Training 18S Classifier ##############
Here are the primer sequences:
5’-Adapter: AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
18S V8-9 Forward CCCTACACGACGCTCTTCCGATCTTCCGTAGGTGAACCTGCGGATAACAGGTCTGTGATGCCCT
3’-Adapter: CAAGCAGAAGACGGCATACGAGATXXXXXXXXXXXX1GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (note: I have no idea what a 1 is doing in there, this is what the sequencing core sent me)
18S V8-9 Reverse GTTCAGACGTGTGCTCTTCCGATCTCCTTCYGCAGGTTCACCTAC
qiime feature-classifier extract-reads \
--i-sequences silva-138.1-ssu-seqs-derep-lca.qza \
--p-f-primer TCCGTAGGTGAACCTGCGGATAACAGGTCTGTGATGCCCT \
--p-r-primer CCTTCYGCAGGTTCACCTAC \
--p-n-jobs 8 \
--p-read-orientation 'forward' \
--o-reads silva-138.1-ssu-nr99-seqs-2019-18S.qza
qiime rescript dereplicate \
--i-sequences silva-138.1-ssu-nr99-seqs-2019-18S.qza \
--i-taxa silva-138.1-ssu-tax-derep-lca.qza \
--p-mode 'lca' \
--o-dereplicated-sequences silva-138.1-ssu-nr99-2019-18S-seqs-lca.qza \
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-2019-18S-derep-lca.qza
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads silva-138.1-ssu-nr99-2019-18S-seqs-lca.qza \
--i-reference-taxonomy silva-138.1-ssu-nr99-tax-2019-18S-derep-lca.qza \
--o-classifier silva-138.1-ssu-nr99-2019-18S-lca-classifier.qza
Given that the 16S classifier seemed to work fine, while the 18S didn't, and I used the exact same steps (as far as I can tell), I'm at a bit of a loss! Any suggestions are welcome! Thanks!
UPDATE
I've been looking back at the 16S data and it seems to have the same problem. I have a little R script that I use to aggregate taxonomy files, and I'm seeing a big change. Here's the R script, if anyone is interested:
# load the packages
library(qiime2R)
library(tidyverse)
library(dplyr)
# set working directory
# load the necessary files
taxonomy <- read_qza("STP-18S-single-end-SILVA-taxonomy.qza")$data
SVs <- read_qza("STP-18S-single-end-trimmed-table-5.qza")$data
# The first column of the SVs file isn't really a column, it is the row names
# We must make it a real column and give it a name so it will merge with taxonomy
SVs <- as.data.frame(SVs)
SVs <- tibble::rownames_to_column(SVs, "Feature.ID")
# join the read counts in the SVs file, with the taxonomy data
taxonomyCounts <- merge(taxonomy, SVs)
# remove unnecessary columns, such as Feature.ID, Confidence and Mock1
# check to make sure you're removing the correct columns!!!
taxonomyCounts <- taxonomyCounts[,c(2,7:19)]
# Condense the table so that any duplicated taxonomy calls are summed into one row
# eg there are currently multiple rows labeled "D_0_Bacteria" and they will all be
# collapsed into one row labeled "D_0_Baceria"
# The . ~ tells the command to aggregate all the columns by the Taxon column
taxCounts <- aggregate(. ~ Taxon, taxonomyCounts, sum)
When I run this, I end up with the following numbers:
- Using my trained classifier, 18S data starts with 1254 observations, ends with 32 (!!)
- Using my trained classifier, the 16S data starts with 6263 observations, ends with 42 (!!)
- Using the provided SILVA classifier, the 16S data starts with 6263 observations and ends with 1314
*Using the provided Greengenes2 classifier, the 16S data starts with 6263 observations and ends with 1440
So I have messed up somewhere in both classifiers (16S and 18S) that I tried to make myself.