Feature classifier fails to accurately classify one set of samples (works for other sets)

mike_kratz · August 3, 2022, 7:50pm

Hi there!

I tried looking for similar posts to my problem but didn't see any offhand (I'm sorry if this has been answered before since I don't want to waste your time!). Briefly, I have three different sequencing datasets (from river water samples) all using the same primer region (515yF and 926pfR) generated from three different sequencing runs (two from the same company [Mr. DNA] and one from another [RTL Genomics]: two of these sequencing sets appear to be "properly classified" (i.e they had hits past "Domain Bacteria" and had taxa similar to prior river samples) while one set of samples fails to be "properly classified". The classification was carried out with a self-made q2-feature classifier using SILVA v138 and trimmed to primer specific sequence regions with the RESCRIPt plugin (following the tutorial's instructions). This classifier did not have this issue with a previous dataset that was more complex. This was all carried out with QIIME2 v 2022.2 installed using conda on Ubuntu v 20.24.

Details of analysis:

Demux quality plots for each sequencing run:
mr.dna-aug-demux.qzv (315.9 KB)
mr.dna-dec-demux.qzv (313.6 KB)
rtl-aug-demux.qzv (316.6 KB)

#I used DADA2 (denoise-single) separately for each set of samples with the same trim and truncation #parameters to follow the assumptions of the DADA2 error model. Ex.
qiime dada2 denoise-single
--i-demultiplexed-seqs mr.dna-aug-demux.qza
--p-trim-left 9
--p-trunc-len 294
--o-representative-sequences mr.dna.aug.rep-seqs.qza
--o-table mr.dna.aug.table.qza
--o-denoising-stats mr.dna.aug.stats.qza

Output from all three runs:
mr.dna.aug.stats.qzv (1.2 MB)
mr.dna.dec.stats.qzv (1.2 MB)
rtl.aug.stats.qzv (1.2 MB)

#All three DADA2 output files were merged using:
qiime feature-table merge
--i-tables ./mr.dna-aug-seqs/mr.dna.aug.table.qza
--i-tables ./mr.dna-dec-seqs/mr.dna.dec.table.qza
--i-tables ./rtl-aug-seqs/rtl.aug.table.qza
--o-merged-table dada2.merged.table.qza

qiime feature-table merge-seqs
--i-data ./mr.dna-aug-seqs/mr.dna.aug.rep-seqs.qza
--i-data ./mr.dna-dec-seqs/mr.dna.dec.rep-seqs.qza
--i-data ./rtl-aug-seqs/rtl.aug.rep-seqs.qza
--o-merged-data merged.rep-seqs.qza

Output from merging:
merged.rep-seqs.qzv (1.1 MB)
dada2.merged.table.qzv (646.8 KB)

#All well and smooth. Now here comes the fun part...
#I run the RESCRIPt made q2-feature classifier (mentioned above):
qiime feature-classifier classify-sklearn
--p-n-jobs -1
--p-reads-per-batch 5000 \
--i-classifier silva-138-ssu-nr99-515f-926r-classifier.qza
--i-reads merged.rep-seqs.qza
--o-classification bonita.taxonomy.qza

#Filter the output
qiime taxa filter-table
--i-table dada2.merged.table.qza
--i-taxonomy bonita.taxonomy.qza
--p-exclude mitochondria,chloroplast,eukaryota
--o-filtered-table dada2.merged.filtered.table.qza

#And generate barplots
qiime taxa barplot
--i-table dada2.merged.filtered.table.qza
--i-taxonomy bonita.taxonomy.qza
--m-metadata-file bonita.metadata.tsv
--o-visualization filtered-taxa-bar-plots.qzv

#Barplots
filtered-taxa-bar-plots.qzv (2.0 MB)

So something is suspect about the December sampling run (mr.dna-dec-demux.qza) compared to the other two, even though they were processed identically. I am not sure if I made an error earlier on or if I am overlooking something, but I am quite confused. Here are the counts/taxonomic classification I got back from Mr. DNA but I wanted to have all the datasets run through the same pipeline for realistic comparisons. I am not sure if this has something to do with the primer specific regions from the RESCRIPt trained classifier, but this shouldn't really impact the results if they were sequenced with the primer region.
FullTaxa.genus.counts.txt (55.7 KB)

Thank you for your help!

SoilRotifer · August 4, 2022, 4:05pm

Hi @mike_kratz, welcome to :qiime2:!

Hopefully, we can help you out here. When I see very few taxonomic groups being assigned, as you have here, the usual culprit is that your sequencing reads might be in mixed orientation. You can use qiime rescript orient-seqs ... to help resolve some of this. Especially, if you plan to construct phylogenetic trees, or are using the qiime feature-classifier classify-sklearn classifier .

But, before doing that, you can quickly sanity check taxonomy assignments by using qiime feature-classifier classify-consensus-vsearch, if I remember correctly, this approach can assign taxonomy regardless of the direction of the reads. This would solve the taxonomy assignment issue, but not the phylogeny issue, as the reads need to be oriented in the same direction for proper alignments and the resulting phylogeny to be constructed.

If you observe something that makes sense via classify-consensus-vsearch, and would like to use qiime feature-classifier classify-sklearn then you'll need to use the orient-seqs command I referred to earlier.

If you search the forum for mixed orientation, you'll see there are quite a few different threads and potential solutions for this.

Let us know what you observe.

mike_kratz · August 5, 2022, 4:11am

Thank you @SoilRotifer for your helpful response! That seems to make perfect sense, I will work on using qiime feature-classifier classify-consensus-vsearch and get back to you if it works!

Thank you again for your response!

mike_kratz · August 5, 2022, 9:19pm

@SoilRotifer Hi Mike, unfortunately qiime feature-classifier classify-consensus-vsearch did not help, I ended up with similar results (almost none were classified in Domain Bacteria this time).

vsearch-taxa-bar-plots.qzv (1.8 MB)

I read through some of the mixed orientation posts but I just want to confirm. Mixed orientations reads are due to forward and reverse reads not being properly separated into separate R1 and R2 files, correct (this leads to some sequences in the same file beginning with different primers)?

Would I have to convert my demux.qza file to the 'FeatureData[Sequence]' filetype in order to properly run 'orient-seqs'? I ran 'orient-seqs' with my dada2 denoised output and it worked (didn't give an error message) but I'm not sure if that procedure is logically sound (I'm assuming it is better to orient before dada2 processing for the sake of the error model parameters and I'm also missing all of the forward reads from the R2 file).

If converting the demux.qza file is the right way to go, how would I go about doing that?

Thank you for your help!

SoilRotifer · August 6, 2022, 11:52pm

Correct.

Yes, exactly.

I had a lapse in my thinking when I recommended the orient-seqs command. Sorry to lead you astray!

I think the only tool we have implemented in QIIME 2 that can demux reads in mixed orientation is the qiime cutadapt demux-paired ... command using the --p-mixed-orientation command. But I think cutadapt expects the barcode to be within the sequence itself. Which you can do by concatenating the I1 sequence to the R1 as outlined here. But I think you'd still have to know something about orientation.

I'd ask your sequencing facility to help you out with this. I am sure they have code or tools available to help you get the data in the correct format.

mike_kratz · August 7, 2022, 12:10am

@SoilRotifer No worries! Ok Mr. DNA (the sequencing facility) does have processing tools that are free to use, but I'll have to check to see if any of them can be used in this scenario. If these aren't useful, I'll talk with my advisor and we will contact the facility. Thank you again for your help, you definitely helped clarify things and prevented me from wasting more time on an "impossible" task.

Take care! (also thank you for your work on the RESCRIPt package, it was very helpful)

SoilRotifer · August 7, 2022, 1:09am

Great! Keep us posted!

system · September 7, 2022, 7:10am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.