Host DNA or incorrect classifer workflow leading to high percentages just domain level classification

David_Bradshaw · September 14, 2021, 7:52pm

Dear QIIME2 Developers,

I have searched the forum for questions similar to this and have found the following thread that was very helpful D_0__Bacteria - uncultured bacterium from water samples, but I wanted to get advice specific to my situation and honestly my classifer making scripts before moving forward with analysis.

Versions:
QIIME2 2021.8 via conda
Minimal RESCRIPT installed following directions found here GitHub - bokulich-lab/RESCRIPt: REference Sequence annotation and CuRatIon Pipeline

I used dada2 to analyze sequences from Genewiz 16S EZ, whose primers are similar to the 341f & 805r primers, thus I used RESCRIPT to build a classifier for them per advice from this post Training classifier without primers information. I previously used these steps to analyze whole fish larvae for microbiome analysis and it worked great, but now I am analyzing samples from gut contents (scrapped the interior of guts to get both fecal and resident bacteria) and I have a large percentage (ranging from 85-49%) of sequences just classified to the domain level as Bacteria (d__Bacteria; p__;etc..). I have BLASTed a couple of these sequences (they are actually my top 6 most common sequences) and none match well to 16S sequences coming up as what I assume is host DNA rep_seqs_final.qzv (1.2 MB) .

My main question would be have I just sequenced a bunch of host DNA (as mentioned in the post I cited as a possibility especially since I used dada2 and they are unlikely to be new phyla or chimeras) or did I mess up my classifer procedure somehow? Please see steps I ran below.

Thank you for your time and help,

Sincerely,

David Bradshaw

#Note Genewiz says that these primers supposedly help id species better, hence me keeping the labels despite warning on guidance to see if true
qiime rescript get-silva-data
--p-version '138.1'
--p-target 'SSURef_NR99'
--p-include-species-labels
--o-silva-sequences silva-138.1-ssu-nr99-rna.qza
--o-silva-taxonomy silva-138.1-ssu-nr99-tax.qza

qiime rescript reverse-transcribe
--i-rna-sequences silva-138.1-ssu-nr99-rna.qza
--o-dna-sequences silva-138.1-ssu-nr99-seqs.qza

qiime rescript cull-seqs
--p-n-jobs 4
--i-sequences silva-138.1-ssu-nr99-seqs.qza
--o-clean-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza

#Note I do not need any eukaryota in my classifier, thus 9999 instead of suggested number in tutorial
qiime rescript filter-seqs-length-by-taxon
--i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza
--i-taxonomy silva-138.1-ssu-nr99-tax.qza
--p-labels Archaea Bacteria Eukaryota
--p-min-lens 900 1200 9999
--o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza
--o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza

qiime rescript dereplicate
--i-sequences silva-138.1-ssu-nr99-seqs-filt.qza
--i-taxa silva-138.1-ssu-nr99-tax.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza

qiime feature-classifier extract-reads
--i-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GACTACHVGGGTATCTAATCC
--p-n-jobs 4
--p-read-orientation 'forward'
--o-reads silva-138.1-ssu-nr99-341f-805r-seqs.qza

#I liked the combo approach of the super mode over the other modes, could this be a problem?
qiime rescript dereplicate
--i-sequences silva-138.1-ssu-nr99-341f-805r-seqs.qza
--i-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza
--p-rank-handles 'silva'
--p-mode 'super'
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-341f-805r-derep-super.qza
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-341f-805r-derep-super.qza

qiime rescript evaluate-fit-classifier
--i-sequences silva-138.1-ssu-nr99-seqs-341f-805r-derep-super.qza
--i-taxonomy silva-138.1-ssu-nr99-tax-341f-805r-derep-super.qza
--o-classifier silva-138.1-99-341f-805r-2021.8-classifier.qza
--o-observed-taxonomy silva-138-99-341f-805r--derep-super-taxonomy-predicted-taxonomy.qza
--o-evaluation silva-138-99-341f-805r--derep-super-taxonomy-fit-classifier-evaluation.qzv
--p-reads-per-batch 10000

SoilRotifer · September 16, 2021, 7:38pm

Hi @David_Bradshaw,

I think much of what you have looks fine. Just a couple comments / suggestions:

EDIT (05/22/2023): ~~If you are running qiime rescript get-silva-data, there is no need to run the above command, as get-silva-data runs this for you behind the scenes.~~ My mistake, this command should be run.

I'd avoid removing Eukaryotes. A good classifier will contain a bunch of "outgroup" taxa. That is you should want to identify Eukaryotes, or any other "bad" items. Otherwise, Eukaryotic sequences might be incorrectly classified as Archaea or Bacteria. See here and here.

This is perfectly fine. There is more than way to do things.

Finally, I'd highly suggest you make sure that your reads are not in mixed orientation. For more details on what I mean please see:

David_Bradshaw · September 20, 2021, 1:28pm

Dear Mike Robeson,

Thank you very much for all the help and guidance.

I did not realize that the get-silva-data already runs the reverse-transcribe script, thank you for pointing that out.

Good to know about needing to contain some outgroup taxa, I will remake the classifier with Eukaryotes then.

Glad to know that super is still a good mode to use, it was one of the things I did differently from the main RESCRIPT pipeline, so wanted to be sure.

I will use the classify-consensus-vsearch method in the posts you provided to see if the mixed orientation could be a problem.

Sincerely,

David

SoilRotifer · September 20, 2021, 1:53pm

You're welcome!

Do not forget you can try the orient-seqs command in RESCRIPt too.

Keep us posted!

David_Bradshaw · September 23, 2021, 8:54pm

Dear Mike Robeson,

Thank you I will try the orient-seqs too, but near as I can tell I can only do it on my post dada2 sequences because I have imported my sequences in as SampleData[PairedEndSequencesWithQuality] instead of the required FeatureData[Sequence] which is thankfully the output of dada2. Although I imagine I should be reorienting before dada2 to be most effective?

I will try all methods and keep you posted as soon as I can! Thank you very much!

Sincerely,

David

David_Bradshaw · October 12, 2021, 12:54pm

Dear Mike Robeson,

Orient-seqs with sklearn and classify-consensus-vsearch (both with the remade classifier with Eukaryotes) essentially lead to similar results so thank you for suggesting both of them. Seeing them have similar results makes me fell more confident in the results. After these two steps all of the samples had similar microbial communities instead of some having very high percentages of D_1__Bacteria;; etc... which mostly turned out unassigned with vsearch or were straight up removed by orient-seqs. There were also no obvious outliers in terms of community members in the samples. Thank you very much for your time and help. I will call this thread solved now.

Sincerely,

David

system · November 12, 2021, 6:55pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.