Host DNA or incorrect classifer workflow leading to high percentages just domain level classification

Dear QIIME2 Developers,

I have searched the forum for questions similar to this and have found the following thread that was very helpful D_0__Bacteria - uncultured bacterium from water samples, but I wanted to get advice specific to my situation and honestly my classifer making scripts before moving forward with analysis.

Versions:
QIIME2 2021.8 via conda
Minimal RESCRIPT installed following directions found here https://github.com/bokulich-lab/RESCRIPt

I used dada2 to analyze sequences from Genewiz 16S EZ, whose primers are similar to the 341f & 805r primers, thus I used RESCRIPT to build a classifier for them per advice from this post Training classifier without primers information. I previously used these steps to analyze whole fish larvae for microbiome analysis and it worked great, but now I am analyzing samples from gut contents (scrapped the interior of guts to get both fecal and resident bacteria) and I have a large percentage (ranging from 85-49%) of sequences just classified to the domain level as Bacteria (d__Bacteria; p__;etc..). I have BLASTed a couple of these sequences (they are actually my top 6 most common sequences) and none match well to 16S sequences coming up as what I assume is host DNA rep_seqs_final.qzv (1.2 MB) .

My main question would be have I just sequenced a bunch of host DNA (as mentioned in the post I cited as a possibility especially since I used dada2 and they are unlikely to be new phyla or chimeras) or did I mess up my classifer procedure somehow? Please see steps I ran below.

Thank you for your time and help,

Sincerely,

David Bradshaw

#Note Genewiz says that these primers supposedly help id species better, hence me keeping the labels despite warning on guidance to see if true
qiime rescript get-silva-data
--p-version '138.1'
--p-target 'SSURef_NR99'
--p-include-species-labels
--o-silva-sequences silva-138.1-ssu-nr99-rna.qza
--o-silva-taxonomy silva-138.1-ssu-nr99-tax.qza

qiime rescript reverse-transcribe
--i-rna-sequences silva-138.1-ssu-nr99-rna.qza
--o-dna-sequences silva-138.1-ssu-nr99-seqs.qza

qiime rescript cull-seqs
--p-n-jobs 4
--i-sequences silva-138.1-ssu-nr99-seqs.qza
--o-clean-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza

#Note I do not need any eukaryota in my classifier, thus 9999 instead of suggested number in tutorial
qiime rescript filter-seqs-length-by-taxon
--i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza
--i-taxonomy silva-138.1-ssu-nr99-tax.qza
--p-labels Archaea Bacteria Eukaryota
--p-min-lens 900 1200 9999
--o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza
--o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza

qiime rescript dereplicate
--i-sequences silva-138.1-ssu-nr99-seqs-filt.qza
--i-taxa silva-138.1-ssu-nr99-tax.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza

qiime feature-classifier extract-reads
--i-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GACTACHVGGGTATCTAATCC
--p-n-jobs 4
--p-read-orientation 'forward'
--o-reads silva-138.1-ssu-nr99-341f-805r-seqs.qza

#I liked the combo approach of the super mode over the other modes, could this be a problem?
qiime rescript dereplicate
--i-sequences silva-138.1-ssu-nr99-341f-805r-seqs.qza
--i-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza
--p-rank-handles 'silva'
--p-mode 'super'
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-341f-805r-derep-super.qza
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-341f-805r-derep-super.qza

qiime rescript evaluate-fit-classifier
--i-sequences silva-138.1-ssu-nr99-seqs-341f-805r-derep-super.qza
--i-taxonomy silva-138.1-ssu-nr99-tax-341f-805r-derep-super.qza
--o-classifier silva-138.1-99-341f-805r-2021.8-classifier.qza
--o-observed-taxonomy silva-138-99-341f-805r--derep-super-taxonomy-predicted-taxonomy.qza
--o-evaluation silva-138-99-341f-805r--derep-super-taxonomy-fit-classifier-evaluation.qzv
--p-reads-per-batch 10000

Hi @David_Bradshaw,

I think much of what you have looks fine. Just a couple comments / suggestions:

If you are running qiime rescript get-silva-data, there is no need to run the above command, as get-silva-data runs this for you behind the scenes. :no_entry:

I'd avoid removing Eukaryotes. A good classifier will contain a bunch of "outgroup" taxa. That is you should want to identify Eukaryotes, or any other "bad" items. Otherwise, Eukaryotic sequences might be incorrectly classified as Archaea or Bacteria. See here and here. :bar_chart:

This is perfectly fine. There is more than :one: way to do things.

:thinking: Finally, I'd highly suggest you make sure that your reads are not in mixed orientation. For more details on what I mean please see:

1 Like