Slow performance of feature-classifier extract-reads

I’m having the same issue as Extract reference reads and How necessary is feature-classifier extract-reads?. The command below took 8 seconds to trim three sequences. I’m relieved to hear there is a workaround to simply skip this step, but I just wanted to add my feedback. My amplicon locus is eukaryotic COI.

time qiime feature-classifier extract-reads \
  --i-sequences Three_Seqs_COI_seqs.qza \
  --o-reads Three_Seqs_COI_seqs_trimmed_BE.qza
Saved FeatureData[Sequence] to: Three_Seqs_COI_seqs_trimmed_BE.qza

real     0m7.879s
user     0m5.425s
sys      0m0.858s

Hi @Luke_Thompson,

Much of that time includes the time to read/write sequences, and this would not scale linearly, so is not really a valuable estimate. But indeed extract-reads can be slow — particularly for COI we get this complaint (perhaps because the raw reference reads are longer?)

And indeed, trimming is only recommended but not required. We see a small accuracy improvement for 16S, but none for ITS. I have not tested COI but don’t expect dramatically different results. Of course, even though extract-reads may be time consuming now, using trimmed reference sequences can speed up downstream steps, particularly if you are using an alignment-based classifier.

Thanks for the feedback!

Thanks @Nicholas_Bokulich! I wonder if the slowdown with COI is due to the high degeneracy of the primer sequences. Another user raised this possibility before, and I think it makes sense, as the COI primers are much more degenerate than the 16S and 18S primers where extract-reads runtimes were much shorter.


very good point — degeneracy will slow this down so that would explain the longer runtimes (length could also contribute). Thanks!

