`qiime feature-classifier extract-reads` takes a long time

Hi,

I’ve been running qiime feature-classifier extract-reads for over 18 hours on a custom dataset with ~ 3M sequence. I’ve added the --verbose option, but there’s no output and no debugging info. I was wondering if it will ever finish.

Thanks,
Cornel

1 Like

You should check your system activity (e.g., with the “activity monitor” on mac). If you do not receive an error message, it is almost certainly still running.

3M strikes me as a very large number of sequences to trim/train. You should dereplicate or even cluster those sequences prior to trimming/training a classifier. The extract-reads method was written with slimmer reference databases in mind — as dereplication/clustering is typically used to reduce redundancy in these datasets (though perhaps you had planned to dereplicate after trimming, which also makes sense).

I hope that helps!

1 Like

The process takes 100% of a CPU/core, so it’s doing something, but there’s nothing indicating that it’s actually processing any data. I was hoping the --verbose option would give me some clues.

Yes, I’ve dereplicated them [no duplicates (organism, sequence) pairs]. I’m not quite sure how I’d cluster them first.

Thanks,
Cornel

Sounds like it is still working. You should get an error message if the job actually crashes (e.g., due to a memory error).

Verbose does not print any output with this particular job. It does for some QIIME2 commands — but for most it is mostly there to print error messages to the standard output (as opposed to an error log).

You could use q2-vsearch to cluster these — though that would just cluster the sequences, not the taxonomies. Deciding how to collapse taxonomies by majority or consensus labels could get complicated, so there’s not really an easy way (certainly not in QIIME2).

If you’ve already dereplicated then there is not much more that can be done. Sounds like your jobs is still running, but 3M is a lot of sequences compared to the standard reference databases that we are generally using with this command (e.g., SILVA, greengenes that have been dereplicated and clustered). You may just need to keep waiting on this job…

good luck!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.