Taxonomic classification failure with primer trimmed data, EzBioCloud classifier

Hello,

I am a user using Qiime2. [Usage information : Qiime2(ver. 2023.2), Conda enviorment]
I am writing to ask for a solution to an error that occurred during the taxonomy classification step.

Earlier this year, I conducted an analysis involving 28 samples of 16S rRNA gene sequencing data, starting from raw data and proceeding to diversity, taxonomy, and beyond using Qiime2. Furthermore, I utilized tools such as PICRUSt2 to extract valuable results from the data.

However, being my first foray into microbiome analysis, I wasn't as proficient in handling sequence data. I wasn't aware that removing primers beforehand using tools like Cutadapt could yield a greater number of reads prior to performing DADA2 trimming.

Consequently, recently, I preemptively removed primers using Cutadapt and then proceeded with the DADA2 step to secure a larger pool of reads. With this enhanced dataset, I attempted a reanalysis of the data.

The sequences of the primers targeting the V3-V4 region, as well as the non-trimmed and trimmed DADA2 results for each sample, are presented below.

[Raw data information]
paired end sequencing (300bp), adapter removed, primer included

[Primers information]

  • 341F : CCT ACG GGN GGC WGC AG
  • 805R : GAC TAC HVG GGT ATC TAA TCC

[The codes used]
# The data that went directly into DADA2 analysis without primer removal

qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux-trimmed.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 300
--p-trunc-len-r 256
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza

# When trimming primers with Cutadapt

qiime cutadapt trim-paired
--i-demultiplexed-sequences paired-end-demux.qza
--p-front-f CCTACGGGNGGCWGCAG
--p-front-r GACTACHVGGGTATCTAATCC
--p-match-adapter-wildcards
--p-match-read-wildcards
--p-discard-untrimmed
--o-trimmed-sequences paired-end-demux-trimmed.qza

qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux-trimmed.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 283
--p-trunc-len-r 247
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza

[Primer non-trimmed result]

[Primer trimmed result (Cutadapt)]

In addition, along with Cutadapt-based primer removal, I also attempted DADA2-based primer trimming.

[The codes used]

qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux-trimmed.qza
--p-trim-left-f 17
--p-trim-left-r 21
--p-trunc-len-f 300
--p-trunc-len-r 256
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza

[Primer trimmed result (DADA2)]

As evident from the results above, when performing primer trimming, I observed a significant increase in the number of reads during the DADA2 process compared to the previous analysis. Encouraged by this, I aimed to proceed with taxonomy classification.

In my case, I have been focusing on the detection of Salmonella. Previous results using SILVA or Greengenes databases showed challenges in detecting Salmonella during analysis.
(Related URL)

Hence, I trained a classifier based on the EzBioCloud 16S database for taxonomy classification. Consequently, I employed the same EzBioCloud-based classifier for classification attempts this time as well.

[The codes used]

qiime feature-classifier extract-reads
--i-sequences ezbiocloud_qiime_full.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GACTACHVGGGTATCTAATCC
--o-reads ref-seqs-V34.qza

qiime feature-classifier extract-reads
--i-sequences ezbiocloud_qiime_full.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GGATTAGATACCCBDGTAGTC
--p-min-length 0
--p-max-length 400
--o-reads ref-seqs-V34.qza

Despite entering primer sequences as mentioned above and experimenting with various conditions for -p-length, including both default and maximum values like 400, 500, and 600, the taxonomic classification results remained consistent across all features. They were identified as either a single species, 'Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales;Desulfobacteraceae;Desulfamplus;Desulfobacterium niacini,' or 'Bacteria;Firmicutes;Clostridia;Clostridiales;Peptostreptococcaceae;Romboutsia;HQ790341_s.'

Previously, when training the classifier with default values (using non-primer trimmed data), a diverse range of taxonomic classifications were achieved successfully. However, this time, I encountered the following challenging dilemma.

[Wrong results]

Consequently, based on these outcomes, I scoured various Qiime forums and considered the following aspects while adjusting the -p-identity parameter to perform further analyses.
(Related link1, Related link2)

[The codes used]

qiime feature-classifier extract-reads
--i-sequences ezbiocloud_qiime_full.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GACTACHVGGGTATCTAATCC
--p-min-length 0
--p-max-length 400
--p-identity 0.90
--o-reads ref-seqs-V34.qza

When the -p-identity parameter was introduced, unlike the previous results, there was a positive outcome where several additional taxonomies, including Aeromonas, were classified. However, I still observed that this improvement remained limited to a relatively low taxa level.

I'm puzzled by the discrepancies in taxonomic classification due to primer trimming, and I'm struggling to anticipate which aspect of my analysis process might be responsible. I've also explored forums regarding 'hot spring metagenome,' but I couldn't identify any distinct length-related characteristics that would clearly differentiate Desulfobacterium niacini or HQ790341_s, both of which are classified as single species, from other species sequences.

Could someone provide advice on how to address this issue and achieve more comprehensive classification, please?

1 Like

Hello @yonghyun09,

You could try using a mock community dataset with your classifier to isolate that part of your analysis.

Hello @colinvwood, Thank you for your response.
Does using a mock community dataset mean to assess the quality of a classifier?
While the quality of the classifier has been confirmed to be satisfactory, could this possibly pertain to other methods of analysis?
I would appreciate it if you could provide details about the specific methods. Thank you.

Hello @yonghyun09,

Just to confirm. You have one original dataset and one trained classifier. When you do not trim primers, run dada2, and classify you see higher diversity. When you do trim primers, run dada2, and classify you see lower diversity. There are no variables other than primer trimming?

Hello @colinvwood,

Yes, that's correct. The only difference from the previous analysis and the changed approach is the trimming of the primer, so there were no other variables aside from the primer trimming.

Hello @yonghyun09,

What are the relative abundances of those two species and unclassified/unknown assignments (if any) in your samples?

What commands did you use to train the classifier and to classify your sequences?

Hello @colinvwood

When Desulfobacterium niacini was classified:
Desulfobacterium niacini (99.975~100%), unclassified/unknown (0.0~0.025%)
taxa-bar-plots_De.qzv (338.8 KB)

When HQ790341_s was classified:
HQ790341_s (48.082~85.542%), unclassified/unknown (14.458~51.918%)
taxa-bar-plots_HQ.qzv (339.7 KB)

[Codes to classify my samples]

qiime tools import
--type 'FeatureData[Sequence]'
--input-path ezbiocloud_qiime_full.fasta
--output-path ezbiocloud_qiime_full.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat
--input-path ezbiocloud_id_taxonomy.txt
--output-path ref-taxonomy.qza

qiime feature-classifier extract-reads
--i-sequences ezbiocloud_qiime_full.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GACTACHVGGGTATCTAATCC
--p-min-length 0
--p-max-length 0
--o-reads ref-seqs-V34.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref-seqs-V34.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classifier classifier.qza

qiime feature-classifier classify-sklearn
--i-classifier classifier.qza
--i-reads rep-seqs-dada2.qza
--o-classification taxonomy.qza

qiime metadata tabulate
--m-input-file taxonomy.qza
--o-visualization taxonomy.qzv

The provided codes were used, and various classifiers were trained by adjusting the --p-length parameter. When analyzing samples with non-trimmed primers in the past, applying the variables as shown below resulted in consistent classification of my samples:

For Desulfobacterium niacini classification (example):
--p-min-length 400
--p-max-length 600

For HQ790341_s classification (example):
--p-min-length 0
--p-max-length 0 (or 500, etc.)

The mentioned parameters were employed for training.

And when --p-max-length is set to 400 (min = 0), some species are observed more specifically as shown below, but unlike the data at the time of the first analysis, there was a markedly insufficient aspect.
taxa-bar-plots.qzv (363.9 KB)

1 Like

Hello @yonghyun09,

From reading other posts it sounds like one way to "solve" this issue is to remove the problematic sequences (i.e. the ones that are dominating the classification) from the classifier.

A higher-level solution is to try another classifier if the one you're using ends up not working for your sequences.

1 Like

Hello @colinvwood ,

Thank you for your response.
I was hesitant about making arbitrary modifications to the contents in the database, so I followed your suggestion and removed Desulfobacterium niacini and HQ790341_s from both FeatureData[Sequence] and FeatureData[Taxonomy].
After doing this, I attempted the analysis again and successfully completed the classification process!

While there weren't any distinctive differences like sequence length or other factors in the original data for Desulfobacterium niacini and HQ790341_s compared to other species, it's puzzling how the mere presence or absence of these species could lead to such variations in results. I'm grateful for your ongoing assistance and will proceed with the next analysis. Thank you.

1 Like