I noticed that sometime, depending on the number of reads, the use of the plugin “qiime feature-classifier classify-sklearn” may produce an output where most of the features are “unidentified” and very few features are classified in only two alternative ways.
In particular, I get very different results if I use reads deriving from different denoising with dada2: denoised reads that gave me, for example, a total of 1585260 reads (for 90 samples) and 1157 different features were classified as follows
That's not supposed to happen! The same classifier should always report the same value for the same ASVs. Of course, using a different classifier with --i-classifier will give you different results.
Would you be willing to help us replicate this bug? If you could post the full command you ran along with the input files, we can see how it works on our systems.
Ah, this is becoming more clear. While your --i-classifier is the same, changes in your dada2 --p-trunc-len-r will cause changes in truncation, which can cause changes in joining, that propagate downstream.
Have you compared your two --o-denoising-stats files to see how many reads were able to join, and how long, on average, were the ones that did?
Hello Colin,
I compared the two --o-denoising-stats files. After dada2 denoising and filtering I obtained the following non-chimeric amounts of sequences. So I see that if I trunc my reverse reads at 170 I obtain more sequences maybe because in this way I take advantage of a better quality of reverse reads
This is not a bug, this is most likely due to mixed read orientations; trimming to different lengths is probably changing the inclusion or order of sequences that are used for read orientation prediction. See here for an explanation:
Some of your reads are being poorly classified no matter the direction, so it could also be that there are many non-target reads in your samples that are interfering with the orientation detector (since the orientation is chosen based on match to a reference sequence).
You can also specify the read orientation if this is known and you do not want classify-sklearn to choose for you.
I added --p-read-orientation same to the classification step and finally classification made sense, regardless of the setting for truncation during denoising and of the number of sequences I had.
My problem is solved! Thank you both for your helping,