Slow performance of feature-classifier extract-reads

I’m having the same issue as Extract reference reads and How necessary is feature-classifier extract-reads?. The command below took 8 seconds to trim three sequences. I’m relieved to hear there is a workaround to simply skip this step, but I just wanted to add my feedback. My amplicon locus is eukaryotic COI.

time qiime feature-classifier extract-reads \
  --i-sequences Three_Seqs_COI_seqs.qza \
  --p-f-primer CCDGAYATRGCDTTYCCDCG \
  --p-r-primer GTRATDGCDCCDGCDARDAC \
  --o-reads Three_Seqs_COI_seqs_trimmed_BE.qza
Saved FeatureData[Sequence] to: Three_Seqs_COI_seqs_trimmed_BE.qza

real     0m7.879s
user     0m5.425s
sys      0m0.858s

Hi @Luke_Thompson,

Much of that time includes the time to read/write sequences, and this would not scale linearly, so is not really a valuable estimate. But indeed extract-reads can be slow — particularly for COI we get this complaint (perhaps because the raw reference reads are longer?)

And indeed, trimming is only recommended but not required. We see a small accuracy improvement for 16S, but none for ITS. I have not tested COI but don't expect dramatically different results. Of course, even though extract-reads may be time consuming now, using trimmed reference sequences can speed up downstream steps, particularly if you are using an alignment-based classifier.

Thanks for the feedback!

Thanks @Nicholas_Bokulich! I wonder if the slowdown with COI is due to the high degeneracy of the primer sequences. Another user raised this possibility before, and I think it makes sense, as the COI primers are much more degenerate than the 16S and 18S primers where extract-reads runtimes were much shorter.

2 Likes

very good point — degeneracy will slow this down so that would explain the longer runtimes (length could also contribute). Thanks!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.