Dear QIIME2 Developers,
I have searched the forum for questions similar to this and have found the following thread that was very helpful D_0__Bacteria - uncultured bacterium from water samples, but I wanted to get advice specific to my situation and honestly my classifer making scripts before moving forward with analysis.
Versions:
QIIME2 2021.8 via conda
Minimal RESCRIPT installed following directions found here GitHub - bokulich-lab/RESCRIPt: REference Sequence annotation and CuRatIon Pipeline
I used dada2 to analyze sequences from Genewiz 16S EZ, whose primers are similar to the 341f & 805r primers, thus I used RESCRIPT to build a classifier for them per advice from this post Training classifier without primers information. I previously used these steps to analyze whole fish larvae for microbiome analysis and it worked great, but now I am analyzing samples from gut contents (scrapped the interior of guts to get both fecal and resident bacteria) and I have a large percentage (ranging from 85-49%) of sequences just classified to the domain level as Bacteria (d__Bacteria; p__;etc..). I have BLASTed a couple of these sequences (they are actually my top 6 most common sequences) and none match well to 16S sequences coming up as what I assume is host DNA rep_seqs_final.qzv (1.2 MB) .
My main question would be have I just sequenced a bunch of host DNA (as mentioned in the post I cited as a possibility especially since I used dada2 and they are unlikely to be new phyla or chimeras) or did I mess up my classifer procedure somehow? Please see steps I ran below.
Thank you for your time and help,
Sincerely,
David Bradshaw
#Note Genewiz says that these primers supposedly help id species better, hence me keeping the labels despite warning on guidance to see if true
qiime rescript get-silva-data
--p-version '138.1'
--p-target 'SSURef_NR99'
--p-include-species-labels
--o-silva-sequences silva-138.1-ssu-nr99-rna.qza
--o-silva-taxonomy silva-138.1-ssu-nr99-tax.qza
qiime rescript reverse-transcribe
--i-rna-sequences silva-138.1-ssu-nr99-rna.qza
--o-dna-sequences silva-138.1-ssu-nr99-seqs.qza
qiime rescript cull-seqs
--p-n-jobs 4
--i-sequences silva-138.1-ssu-nr99-seqs.qza
--o-clean-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza
#Note I do not need any eukaryota in my classifier, thus 9999 instead of suggested number in tutorial
qiime rescript filter-seqs-length-by-taxon
--i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza
--i-taxonomy silva-138.1-ssu-nr99-tax.qza
--p-labels Archaea Bacteria Eukaryota
--p-min-lens 900 1200 9999
--o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza
--o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza
qiime rescript dereplicate
--i-sequences silva-138.1-ssu-nr99-seqs-filt.qza
--i-taxa silva-138.1-ssu-nr99-tax.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza
qiime feature-classifier extract-reads
--i-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GACTACHVGGGTATCTAATCC
--p-n-jobs 4
--p-read-orientation 'forward'
--o-reads silva-138.1-ssu-nr99-341f-805r-seqs.qza
#I liked the combo approach of the super mode over the other modes, could this be a problem?
qiime rescript dereplicate
--i-sequences silva-138.1-ssu-nr99-341f-805r-seqs.qza
--i-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza
--p-rank-handles 'silva'
--p-mode 'super'
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-341f-805r-derep-super.qza
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-341f-805r-derep-super.qza
qiime rescript evaluate-fit-classifier
--i-sequences silva-138.1-ssu-nr99-seqs-341f-805r-derep-super.qza
--i-taxonomy silva-138.1-ssu-nr99-tax-341f-805r-derep-super.qza
--o-classifier silva-138.1-99-341f-805r-2021.8-classifier.qza
--o-observed-taxonomy silva-138-99-341f-805r--derep-super-taxonomy-predicted-taxonomy.qza
--o-evaluation silva-138-99-341f-805r--derep-super-taxonomy-fit-classifier-evaluation.qzv
--p-reads-per-batch 10000