Dear supporters,
I’m relatively new to QIIME2, so apoplogies if I’m missing something obvious.
I’m working with fungal ITS2 data and ran into an issue during taxonomic classification with feature-classifier classify-sklearn. I’m using QIIME2 (q2cli version 2024.2.0, installed via conda on Linux).
Because of large length variation, I couldn’t reliably merge reads within DADA2, so I merged them externally using PEAR and BBmerge, then imported them as single-end reads into QIIME2.
So I have three datasets:
- DADA2 paired-end
- PEAR merged (single-end)
- BBmerge merged (single-end)
Denoising worked fine in all cases (no truncation, only max-ee filtering), and I confirmed that some ASVs are identical in sequence and length across datasets.
I trained a classifier using the UNITE database. In the DADA2 dataset, classification works normally (including species-level). However, in PEAR and BBmerge datasets, almost everything is unassigned (only one species classified).
What’s confusing is that some ASVs that are unassigned in PEAR/BBmerge are exactly the same sequences as ASVs in the DADA2 dataset that are successfully classified. I checked sequence identity, length, and BLAST results, and they are identical.
A few additional observations:
- DADA2 has ~4k ASVs, while PEAR/BBmerge have ~15k
- Confidence scores for the same ASV are slightly lower in merged data (~95% vs ~85%)
- When I test a single sample, PEAR/BBmerge classification works fine
- When using the full dataset, the issue reappears
Since identical sequences are classified differently depending on the dataset size, I suspect this may be related to batch processing or classifier behavior rather than sequence differences.
There are no error messages during execution.
Here are the commands I used:
- DADA2
qiime dada2 denoise-paired \
--i-demultiplexed-seqs paired-end.qza \
--p-trunc-len-f 0 \
--p-trunc-len-r 0 \
--p-max-ee-f 5.0 \
--p-max-ee-r 8.0 \
--p-min-overlap 12 \
--p-n-threads 0 \
--o-table dada2_table.qza \
--o-representative-sequences dada2_rep_seq.qza \
--o-denoising-stats dada2_stats.qza
qiime feature-classifier classify-sklearn \
-i-classifier classifier.qza \
--i-reads dada2_rep_seq.qza \
--p-n-jobs 1 \
--p-reads-per-batch 1000 \
--o-classification dada2_tax.qza
- PEAR
pear -f "$r1_file" -r "$r2_file" -o "$RESULT_DIR/$sample_id" \
-v 20 \
-m 500 \
-n 50 \
-p 0.001 \
-u 0 \
-j 10 \
-y 8G
qiime dada2 denoise-single \
--i-demultiplexed-seqs pear_merged.qza \
--p-trim-left 0 \
--p-trunc-len 0 \
--p-max-ee 5.0 \
--o-table pear_table.qza \
--o-representative-sequences pear_rep_seq.qza \
--o-denoising-stats pear_stats.qza
qiime feature-classifier classify-sklearn \
--i-classifier classifier.qza \
--i-reads pear_rep_seq.qza \
--p-n-jobs 1 \
--p-reads-per-batch 1000 \
--o-classification pear_tax.qza
- BBmerge
bbmerge.sh in1="$r1_file" in2="$r2_file" \
out="${OUTPUT_DIR}/${sample_id}_merged.fastq.gz" \
outu="${OUTPUT_DIR}/${sample_id}_unmerged.fastq.gz" \
mininsert=50 \
maxlength=500 \
k=62 \
extend2=50 \
iterations=5 \
vstrict=f \
loose=t \
threads=10
qiime dada2 denoise-single \
--i-demultiplexed-seqs bb_merged.qza \
--p-trim-left 0 \
--p-trunc-len 0 \
--p-max-ee 5.0 \
--o-table bb_table.qza \
--o-representative-sequences bb_rep_seq.qza \
--o-denoising-stats bb_stats.qza
qiime feature-classifier classify-sklearn \
--i-classifier classifier.qza \
--i-reads bb_rep_seq.qza \
--p-n-jobs 1 \
--p-reads-per-batch 1000 \
--o-classification bb_tax.qza
At this point I’m not sure what could be causing this difference.
Could this be related to dataset size, batch processing (reads-per-batch), or something about how merged reads are handled?
Any suggestions would be really helpful.
Thanks a lot!
Nayun.