Hello everyone,
I'm currently analyzing the COI amplicon region using QIIME2 (version 2022.2) with samples originating from a lake and sequenced using the PE300 sequencing strategy. My amplification primers are as follows:
- Forward: GGWACWGGWTGAACWGTWTAYCCYCC (mlCOIintF)
- Reverse: TANACYTCTGGRTGICCRAARAAYCA (jgHCO219)
In my analysis workflow, I first imported the data and applied DADA2 denoising with the following commands:
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path ${Data} \
--input-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path atcc-paired-end.qza
qiime dada2 denoise-paired \
--i-demultiplexed-seqs atcc-paired-end.qza \
--p-trim-left-f 26 \
--p-trim-left-r 26 \
--p-trunc-len-f 0 \
--p-trunc-len-r 0 \
--p-n-threads 20 \
--o-table atcc_table.qza \
--o-representative-sequences atcc_seqs.qza \
--o-denoising-stats atcc_stats.qza
Subsequently, I trained a COI database following the tutorial "Building a COI database from BOLD references" on the QIIME2 forum and used the following commands for species annotation:
qiime feature-classifier classify-sklearn \
--i-classifier COI.qza \
--i-reads atcc_seqs.qza \
--p-n-jobs 8 \
--o-classification taxonomy.qza
My issue is that the annotation results using classify-sklearn
only included 20 genus-level classifications. However, when directly comparing the same sequences with the NCBI NT database via BLAST, I can find many more COI gene matches.
To explore further, I tried classify-consensus-blast
for comparison:
qiime feature-classifier classify-consensus-blast \
--i-query atcc_seqs.qza \
--i-reference-reads bold_derep1_seq.qza \
--i-reference-taxonomy bold_derep1_taxa.qza \
--p-perc-identity 0.8 \
--o-classification otu-taxonomy.qza
This time, the results showed annotations for 700 genera. However, when adjusting the p-perc-identity
parameter to 0.9, the number of annotated genera decreased to 30.
I'm curious about why there's such a significant discrepancy between the classify-sklearn
annotation results and the BLAST comparisons? What's the impact of adjusting the p-perc-identity
value on the annotation results? Are there recommended approaches to increase the number of genus-level annotations to better match the results from NT database comparisons?
Thank you all for your assistance!