Issue with feature-classifier (single sample vs multiple samples)

I have a problem a problem with results of feature-classifier and can't figure out why is this happening.
When I'm analysing my three samples (R1, R2, R3) together (i.e. in one analysis workflow) my R3 sample is not properly classified (90% of R3 reads are classified as 'unassigned', R1 and R2 are totally fine). On the other hand if I analyse R3 sample alone, using the same workflow, then everything is classified properly and it looks similar to R1 and R2. What can cause this issue and how to fix this?

My workflow (I'm using cutadapt quality and length filtered reads):

q2cli version 2023.9.1

qiime tools import \
  --input-path $dir/ \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path $dir/demux-paired-end.qza
qiime dada2 denoise-paired \
--i-demultiplexed-seqs $dir/demux-paired-end.qza \
--p-trunc-len-f 0 \
--p-trunc-len-r 0 \
--o-representative-sequences $dir/rep-seqs.qza \
--o-table $dir/table.qza \
--o-denoising-stats $dir/stats.qza \
--p-n-threads 16

qiime metadata tabulate \
--m-input-file $dir/stats.qza \
--o-visualization $dir/stats.qzv

qiime feature-classifier classify-sklearn \
--i-classifier $db/silva-138-99-nb-classifier.qza \
--i-reads $dir/rep-seqs.qza \
--o-classification $dir/taxonomy.qza \
--p-n-jobs -1

qiime metadata tabulate \
--m-input-file $dir/taxonomy.qza \
--o-visualization $dir/taxonomy.qzv

qiime taxa barplot \
--i-table $dir/table.qza \
--i-taxonomy $dir/taxonomy.qza \
--o-visualization $dir/taxa-bar-plots.qzv

Hello @kshanowski,

Welcome to the forums!

Let's investigate each step performed.

dada2 denoise-paired makes the ASVs and counts them to make a table.

When running this with just R3 vs R1+R2+R3, how similar are the output ASVs? How similar are the feature counts for R3 in the two tables?

(We can look into taxonomy classification once we prove to ourselves that the DADA2 results are stable.)

Hello, nice to be here!

I should've mentioned that, sorry. They are very similar (difference of ~1-5 %). I don't have .qzv files with me right now, but what I recall - for example for reads left after chimeras removal - in both analysis R3 values are around 74%.

OK, I think that similarity is a good sign.

Can you post the dada2 .qza files when you have a chance? I want to isolate the problem to the dada2 step or the classification step, and to do that I want to see if the .qza files match.


Thank you so much for your time. I attach .qza files.

As I've wanted to simplify sample names, I've named them Rx, but now it actually could become more confusing, sorry for that. Anyway, the name in the analysis of the problematic sample is V3-12-21-R23.

edit: I can't attach more than two files using forums 'upload', here is the link to the files: Nextcloud

1 Like

Thanks for sharing the files!

I looked at the reads inside of rep-seqs_R1_R2_R3.qza, and ran it through vsearch dereplicate.
As expected, all reads were unique.

Then I ran vsearch dereplicate again with --strand both, which checks for identical reads in both directions (forward and reverse.)
This function found lots of hits in the reverse strand!

This means your input reads were in a mixed orientation.
classify-sklearn assumes that all reads are in the same orientation.

To fix this issue, use the RESCRIPt action orient-seqs