qiime feature-classifier extract-reads successfully extracts different number of reads each time

I have prepared an amplicon-region specific classifier with the SILVA database (V3V4) as described in the RESCRIPt tutorial here.

I also added in steps in between to count the number of sequences remaining in the reference database after each curation step (culling low quality, length filter, amplicon region extraction, dereplication) and noticed that each time I run qiime feature-classifier extract-reads I get a slightly different number of sequences left. I thought that the extraction step would be deterministic so am wondering why this doesn't appear to be the case? Is it possible to set a seed if not so it's the same every time?

I am using the 515F and 806R primers, which do contain degenerate bases if that might be a factor.

I'm using the qiime2-amplicon version 2024.10, installed with conda. I've also tried it with different reference databases (GTDB) and amplicon regions (V4) and each time I get a slightly different number of extracted reads.

Thanks in advance!

1 Like

Hi @maisey,

We are under the same impression as you and are investigating.

Will follow up shortly.

2 Likes

Hi @maisey,

Just an update: we've replicated the issue and understand whats going on.

In short: it seems that alignments at the very threshold of your set identity parameter will be non-deterministically dropped depending on the order of expansion of the degenerate bases.

The order of the expanded degenerate sequence (into N non-degenerate ones) isn't strictly defined and will change between Python sessions. So, when the alignment is especially poor, the first one is used which may or may not fail your identity threshold (whereas if a different degenerate expansion had been first, we may see a different result for that read).

To fix in the interim, you can increase --p-identity until the results are stable. I used 0.95 instead of the default 0.8 but I have no special justification for either value.

We're going to discuss this a bit more internally to see how much this impacts things downstream, but I don't think it should be particularly different as this can only happen to alignments that have degenerate expansions that are all bad, but which fall on the edge of the identity threshold.

6 Likes

Thanks for the speedy reply and explanation! I'll try increasing the --p-identity threshold and see if that helps!