ITS2 classification: identical sequences but different results (DADA2 vs merged)

Dear supporters,
I’m relatively new to QIIME2, so apoplogies if I’m missing something obvious.

I’m working with fungal ITS2 data and ran into an issue during taxonomic classification with feature-classifier classify-sklearn. I’m using QIIME2 (q2cli version 2024.2.0, installed via conda on Linux).

Because of large length variation, I couldn’t reliably merge reads within DADA2, so I merged them externally using PEAR and BBmerge, then imported them as single-end reads into QIIME2.

So I have three datasets:

  • DADA2 paired-end
  • PEAR merged (single-end)
  • BBmerge merged (single-end)

Denoising worked fine in all cases (no truncation, only max-ee filtering), and I confirmed that some ASVs are identical in sequence and length across datasets.

I trained a classifier using the UNITE database. In the DADA2 dataset, classification works normally (including species-level). However, in PEAR and BBmerge datasets, almost everything is unassigned (only one species classified).

What’s confusing is that some ASVs that are unassigned in PEAR/BBmerge are exactly the same sequences as ASVs in the DADA2 dataset that are successfully classified. I checked sequence identity, length, and BLAST results, and they are identical.

A few additional observations:

  • DADA2 has ~4k ASVs, while PEAR/BBmerge have ~15k
  • Confidence scores for the same ASV are slightly lower in merged data (~95% vs ~85%)
  • When I test a single sample, PEAR/BBmerge classification works fine
  • When using the full dataset, the issue reappears

Since identical sequences are classified differently depending on the dataset size, I suspect this may be related to batch processing or classifier behavior rather than sequence differences.

There are no error messages during execution.

Here are the commands I used:

  1. DADA2
qiime dada2 denoise-paired \
--i-demultiplexed-seqs paired-end.qza \
--p-trunc-len-f 0 \
--p-trunc-len-r 0 \
--p-max-ee-f 5.0 \
--p-max-ee-r 8.0 \
--p-min-overlap 12 \
--p-n-threads 0 \
--o-table dada2_table.qza \
--o-representative-sequences dada2_rep_seq.qza \
--o-denoising-stats dada2_stats.qza

qiime feature-classifier classify-sklearn \
-i-classifier classifier.qza \
--i-reads dada2_rep_seq.qza \
--p-n-jobs 1 \
--p-reads-per-batch 1000 \
--o-classification dada2_tax.qza
  1. PEAR
pear -f "$r1_file" -r "$r2_file" -o "$RESULT_DIR/$sample_id" \
-v 20 \
-m 500 \
-n 50 \
-p 0.001 \
-u 0 \
-j 10 \
-y 8G

qiime dada2 denoise-single \
--i-demultiplexed-seqs pear_merged.qza \
--p-trim-left 0 \
--p-trunc-len 0 \
--p-max-ee 5.0 \
--o-table pear_table.qza \
--o-representative-sequences pear_rep_seq.qza \
--o-denoising-stats pear_stats.qza

qiime feature-classifier classify-sklearn \
--i-classifier classifier.qza \
--i-reads pear_rep_seq.qza \
--p-n-jobs 1 \
--p-reads-per-batch 1000 \
--o-classification pear_tax.qza

  1. BBmerge
bbmerge.sh in1="$r1_file" in2="$r2_file" \
out="${OUTPUT_DIR}/${sample_id}_merged.fastq.gz" \
outu="${OUTPUT_DIR}/${sample_id}_unmerged.fastq.gz" \
mininsert=50 \
maxlength=500 \
k=62 \
extend2=50 \
iterations=5 \
vstrict=f \
loose=t \
threads=10

qiime dada2 denoise-single \
--i-demultiplexed-seqs bb_merged.qza \
--p-trim-left 0 \
--p-trunc-len 0 \
--p-max-ee 5.0 \
--o-table bb_table.qza \
--o-representative-sequences bb_rep_seq.qza \
--o-denoising-stats bb_stats.qza

qiime feature-classifier classify-sklearn \
--i-classifier classifier.qza \
--i-reads bb_rep_seq.qza \
--p-n-jobs 1 \
--p-reads-per-batch 1000 \
--o-classification bb_tax.qza

At this point I’m not sure what could be causing this difference.

Could this be related to dataset size, batch processing (reads-per-batch), or something about how merged reads are handled?

Any suggestions would be really helpful.

Thanks a lot!
Nayun.

Hello Nayun,

Welcome to the forums! :qiime2: This is a great first post, very detailed. :magnifying_glass_tilted_right:

So I have three datasets

I have been there too. This is hard, especially when the wet lap does not match exactly.

I'm glad you were able to get the paired end reads to merge (eventually!) and imported into Qiime2. Better something than nothing!

DADA2 produces stable ASVs, so as long as they all make it through DADA2, and you can trim the amplicons to exactly the same length (so their hashes match), you should be good to go!

And yet... that's not the case!

What’s confusing is that some ASVs that are unassigned in PEAR/BBmerge are exactly the same sequences as ASVs in the DADA2 dataset that are successfully classified. I checked sequence identity, length, and BLAST results, and they are identical.

Like, 100% identical in the area of overlap? Because a 1 bp overhang or gap would mean that the ASVs are technically different!

Try this: the qiime feature-table merge-seqs command:

# maybe like this
qiime feature-table merge-seqs \
  --i-data bb_rep_seq.qza dada2_rep_seq.qza pear_rep_seq.qza \
  --o-merged-data all_rep_seq.qza

qiime feature-table summarize \
  --i-table all_rep_seq.qza \
  --o-summary all_rep_seq.qzv

qiime feature-classifier classify-sklearn \
  --i-classifier classifier.qza \
  --i-reads all_rep_seq.qza \
  --p-n-jobs 1 \
  --p-reads-per-batch 1000 \
  --o-classification all_rep_seq_tax.qza

There's a lot more questions to answer, so let's start here.

1 Like

Hello Colin,
Thanks for the warm welcome, and I appreciate your suggestion!

I followed your approach and wanted to share an update.

I merged all representative sequences and re-ran the classification:

qiime feature-table merge-seqs \
  --i-data bb_rep_seq.qza dada2_rep_seq.qza pear_rep_seq.qza \
  --o-merged-data all_rep_seq.qza

qiime feature-classifier classify-sklearn \
  --i-classifier classifier.qza \
  --i-reads all_rep_seq.qza \
  --p-n-jobs 1 \
  --p-reads-per-batch 1000 \
  --o-classification all_tax.qza

After visualizing the results (qzv), classification appears to work much better in this merged dataset:

  • Total sequences: 17,639
  • Assigned to Fungi: 3,989
  • Species-level assignments: 1,569 (including 231 unknown sp.)

So in this combined context, classification seems to behave more reasonably.

However, I’m still a bit confused about the inconsistency:

  • The same sequences remain unassigned in the PEAR/BBmerge datasets when processed independently
  • But they become classified when included in the merged dataset

Additionally, I tested splitting the dataset physically into smaller FASTA files (e.g., ~4,000 sequences per file), and classification also worked in that case.

In contrast:

  • Using --p-reads-per-batch 100 did not resolve the issue
  • Only physical splitting or merging datasets changed the outcome

Given this, it seems like classification results are influenced by the overall dataset composition rather than sequence identity alone.

Do you think this could be related to how the classifier handles heterogeneous sequence length distributions or dataset context?

Thanks again for your help!
Nayun

Great! We are making good progress!

When only looking at the end taxonomy, it's hard to see what happened up stream.

Let's see if the ASVs were identical this whole time!

How many sequences are in each of those rep_seq.qza files?
After they are merged, are there fewer because they are the same and they merged?
Or are there more ASVs in the all_rep_seq.qza table because the ASVs were NOT the same?

I want to check if the ASVs match because I think the taxonomy strangeness is it's own issue:


Sort of. Here are the documentation for this plugin:

read_orientation auto will autodetect orientation based on the confidence estimates for the first 100 reads

So the first 100 features / ASV sequences can confuse the classifier, which will mess up the taxonomy classification results. Try 'same' or 'both' instead.

Hello Colin,

Thanks again for your helpful suggestion — I followed your approach and checked the ASV counts across datasets.

Here’s what I found:

  • DADA2: 4,713 ASVs
  • BBmerge: 15,192 ASVs
  • PEAR: 16,340 ASVs
  • Merged (all rep. seq.): 17,639 ASVs

After merging, the total number of ASVs increased rather than decreased, suggesting that many ASVs are not identical across datasets (even if some sequences match exactly). So it seems the datasets are contributing additional unique variants rather than collapsing into shared sequences.

Regarding read orientation, I noticed that my QIIME2 version (2024.2.0) does not support the both option, so I tested using same instead of the default auto.

Using same substantially improved classification:

  • PEAR (same): 1,408 species-level assignments within Fungi (213 unknown species)
  • BBmerge (same): 1,372 species-level assignments within Fungi (207 unknown species)

These results are now consistent across datasets and much more comparable to what I observed in the merged dataset.

For comparison, when I previously classified split subsets (~4,000 sequences per file), I obtained slightly fewer assignments (~1,294 species-level, 195 unknown), which I suspect reflects minor sensitivity of the classifier to dataset composition for borderline sequences.

Overall, this suggests that the main issue was indeed related to orientation detection with auto, rather than sequence identity itself. After fixing orientation, the classification results are stable and consistent across different datasets.

Best,
Nayun

1 Like

Great! I'm glad to hear that 'auto' was causing problems and using 'same' fixed it!

Hey, before you go! Check this out:

After merging, the total number of ASVs increased rather than decreased, suggesting that many ASVs are not identical across datasets (even if some sequences match exactly).

:warning: There's a contradiction in here! Do you see it? :thinking:

Hints:

  • if all the ASVs were different, how many would you have after merging?
  • if all the ASVs were the same, how many would you have after merging?
  • What's the smallest different between two ASVs that they would still merge?
  • If two ASVs 'match exactly' what would cause their sequences not to merge?

Hello Colin,

You’re right — that statement was inaccurate.

Since ASVs are defined by exact sequence identity, identical sequences should collapse into the same feature after merging.

Based on the merged ASV count (17,639), which is lower than the sum of all datasets but higher than any individual dataset, it appears that there is partial overlap: some sequences are identical across datasets, but many differ by small variations (e.g., single nucleotide differences or length variation) and are therefore retained as distinct ASVs.

Thanks for pointing that out.

Best regards,
Nayun

1 Like