I'm reanalyzing a sequence set that was originally analyzed with QIIME 1.9 and SILVA 128. Our project included a focus on methanogens, which are all in Archaea. Our original 1.9 results included a solid ~5% Archaea in several samples. However, with QIIME 2021.11 and SILVA 138 after filtering singletons I'm only getting 33 sequences of a single ASV of Archaea in the wrong phylum, and it's not even in the most originally Archaea-heavy samples. I know there was a big naming reshuffle a couple of years ago at the phylum level, but surely that didn't impact whether something was marked as Archaea versus Bacteria. Several of my original phyla have remained at roughly the same proportions to each other as they were in the 1.9 version such as Verrucomicrobia and Bacteroidetes/Bacteroidota, so that makes me think I ran the analysis correctly overall.
This is interesting. Perhaps we perform some sanity-checking here. Can you classify your reads against the full-length classifier, i.e. skip running the feature-classifier extract-reads step. If the results make more sense compared to your amplicon specific classifier, then this would tell me it is a PCR primer search and extraction issue.
Sometimes when using PCR primer based extraction of an amplicon region, you can lose some data. Likely because the reference sequences you need do not have the portion of sequence you need for a primer match for extraction. You can try our new extract-seq-segments function in RESCRIPt to improve the extraction of amplicon segments from reference sequences that may be missing, one or both of, your primer regions.
If neither work, perhaps try curating your own SILVA reference database as outlined here, to optimize for the retention and classification of archaeal sequences.
Then I will use this classifier output with classify-sklearn, correct? If this gives me the taxonomy I was expecting, I can just use the data as-is at that point, right? I wouldn't have to go back and try something different with extract-reads?
Based on what you said, another possible source of the problem occurred to me. Here is how I processed the sequences right after demultiplexing:
I chose those trimming and truncating locations based on my understanding of how to interpret the denoising stats visualization, but maybe I trimmed too hard. Would it be worth trying this with different numbers?
Correct. On occasion an amplicon specific classifier can improve classification. Though this depends on how the reference data is curated. But is quite optional. You can change-up the order on how to prepare your database too.
For example, if you only want to make an amplicon-region specific classifier with RESCRIPt, you can run the following:
qiime rescript get-silva-data ...
qiime rescript reverse-transcribe ...
qiime rescript dereplicate ... Optional derep of full-length seqs. You can skip to next step, 4 .
qiime feature-classifier extract-reads ... extract region using your primers
Optionally supplement with qiime extract-seq-seqments ... to extract amplicons that are missing primer sequence.
qiime rescript dereplicate ... Derep extracted seqs. Seqs might be identical over the region extracted
qiime rescript cull-seqs ... Optional, but highly recommended.
qiime rescript filter-seqs-length-by-taxon ... or qiime rescript filter-seqs-length ... Optional.
The new classifier didn't bring back the Archaea, unfortunately. So I dug around some more and realized I had probably used the wrong trimming and truncating parameters during the DADA2 step. I chopped my forward reads far too harshly and the reverse reads not harshly enough. Now the taxonomy mostly resembles what we found with QIIME 1, and I think the remaining differences could easily be due to updated taxonomy since SILVA 128. Thanks for giving me a few ideas to try!