Hello,
I am a user using Qiime2. [Usage information : Qiime2(ver. 2023.2), Conda enviorment]
I am writing to ask for a solution to an error that occurred during the taxonomy classification step.
Earlier this year, I conducted an analysis involving 28 samples of 16S rRNA gene sequencing data, starting from raw data and proceeding to diversity, taxonomy, and beyond using Qiime2. Furthermore, I utilized tools such as PICRUSt2 to extract valuable results from the data.
However, being my first foray into microbiome analysis, I wasn't as proficient in handling sequence data. I wasn't aware that removing primers beforehand using tools like Cutadapt could yield a greater number of reads prior to performing DADA2 trimming.
Consequently, recently, I preemptively removed primers using Cutadapt and then proceeded with the DADA2 step to secure a larger pool of reads. With this enhanced dataset, I attempted a reanalysis of the data.
The sequences of the primers targeting the V3-V4 region, as well as the non-trimmed and trimmed DADA2 results for each sample, are presented below.
[Raw data information]
paired end sequencing (300bp), adapter removed, primer included
[Primers information]
- 341F : CCT ACG GGN GGC WGC AG
- 805R : GAC TAC HVG GGT ATC TAA TCC
[The codes used]
# The data that went directly into DADA2 analysis without primer removal
qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux-trimmed.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 300
--p-trunc-len-r 256
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza
# When trimming primers with Cutadapt
qiime cutadapt trim-paired
--i-demultiplexed-sequences paired-end-demux.qza
--p-front-f CCTACGGGNGGCWGCAG
--p-front-r GACTACHVGGGTATCTAATCC
--p-match-adapter-wildcards
--p-match-read-wildcards
--p-discard-untrimmed
--o-trimmed-sequences paired-end-demux-trimmed.qzaqiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux-trimmed.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 283
--p-trunc-len-r 247
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza
[Primer non-trimmed result]
[Primer trimmed result (Cutadapt)]
In addition, along with Cutadapt-based primer removal, I also attempted DADA2-based primer trimming.
[The codes used]
qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux-trimmed.qza
--p-trim-left-f 17
--p-trim-left-r 21
--p-trunc-len-f 300
--p-trunc-len-r 256
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza
[Primer trimmed result (DADA2)]
As evident from the results above, when performing primer trimming, I observed a significant increase in the number of reads during the DADA2 process compared to the previous analysis. Encouraged by this, I aimed to proceed with taxonomy classification.
In my case, I have been focusing on the detection of Salmonella. Previous results using SILVA or Greengenes databases showed challenges in detecting Salmonella during analysis.
(Related URL)
Hence, I trained a classifier based on the EzBioCloud 16S database for taxonomy classification. Consequently, I employed the same EzBioCloud-based classifier for classification attempts this time as well.
[The codes used]
qiime feature-classifier extract-reads
--i-sequences ezbiocloud_qiime_full.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GACTACHVGGGTATCTAATCC
--o-reads ref-seqs-V34.qzaqiime feature-classifier extract-reads
--i-sequences ezbiocloud_qiime_full.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GGATTAGATACCCBDGTAGTC
--p-min-length 0
--p-max-length 400
--o-reads ref-seqs-V34.qza
Despite entering primer sequences as mentioned above and experimenting with various conditions for -p-length, including both default and maximum values like 400, 500, and 600, the taxonomic classification results remained consistent across all features. They were identified as either a single species, 'Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales;Desulfobacteraceae;Desulfamplus;Desulfobacterium niacini,' or 'Bacteria;Firmicutes;Clostridia;Clostridiales;Peptostreptococcaceae;Romboutsia;HQ790341_s.'
Previously, when training the classifier with default values (using non-primer trimmed data), a diverse range of taxonomic classifications were achieved successfully. However, this time, I encountered the following challenging dilemma.
[Wrong results]
Consequently, based on these outcomes, I scoured various Qiime forums and considered the following aspects while adjusting the -p-identity parameter to perform further analyses.
(Related link1, Related link2)
[The codes used]
qiime feature-classifier extract-reads
--i-sequences ezbiocloud_qiime_full.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GACTACHVGGGTATCTAATCC
--p-min-length 0
--p-max-length 400
--p-identity 0.90
--o-reads ref-seqs-V34.qza
When the -p-identity parameter was introduced, unlike the previous results, there was a positive outcome where several additional taxonomies, including Aeromonas, were classified. However, I still observed that this improvement remained limited to a relatively low taxa level.
I'm puzzled by the discrepancies in taxonomic classification due to primer trimming, and I'm struggling to anticipate which aspect of my analysis process might be responsible. I've also explored forums regarding 'hot spring metagenome,' but I couldn't identify any distinct length-related characteristics that would clearly differentiate Desulfobacterium niacini or HQ790341_s, both of which are classified as single species, from other species sequences.
Could someone provide advice on how to address this issue and achieve more comprehensive classification, please?