Parallelize feature-classifier extract-reads to increase speed?

Thank you for your time,

I’d like to preface that I am a novice user of Qiime2 and other bioinformatics techniques; my ideas likely reflect this. However, I genuinely enjoy increasing the speed of computational tasks and would be thrilled to help improve the speed of Qiime2.

It strikes me that the qiime feature-classifier extract-reads command only uses 1 core of the CPU and thus takes a very long time to execute. Can we increase the number of cores to speed up the process? I think this would be an excellent contribution to Qiime especially if training classifiers improves accuracy of taxonomic assignment (i.e., many researchers may want to apply custom trained classifiers). I know read/write speeds can be a bottleneck on some operations but they don’t seem to be here. Additionally, read/write speeds are increasing with advancements in SSD tech such as NVME.

My idea is as follows:

  1. Take the reference sequence file and divide it into multiple smaller parts. The number of divisions would be based on the number of desired cores. The exact locations of the division split would be the end of the reverse primer. For a 4 core processor, this would result in 4 files with approximately equal size (e.g., ref-seq-1.fasta, ref-seq-2.fasta, ref-seq-3.fasta, and ref-seq-4.fasta).
  2. Analyze each of the files on separate cores by piping the files into GNU ‘parallel’ executing multiple qiime feature-classifier extract-reads commands. The output would consist of 4 files whose names mimic the input names (e.g., trimmed-ref-seq-1.fasta, trimmed-ref-seq-2.fasta, trimmed-ref-seq-3.fasta, and trimmed-ref-seq-4.fasta.)
  3. Concatenate the files together keeping the original order. This could employ the ‘cat’ command (e.g., ‘cat trimmed-ref-seq-1.fasta trimmed-ref-seq-2.fasta trimmed-ref-seq-3.fasta trimmed-ref-seq-4.fasta’).
  4. Import the final file back into qiime.

If this works it should reduce the processing time quite dramatically (i.e., 4 times less for 4 cores, 20 times less for 20 cores, etc.). I would like to create this functionality. Am I missing something dramatic?

Thank you again for your time and consideration. I really appreciate the Qiime resources. Without the tutorials and forum answers, I would have had a very hard time characterizing my microbial samples.

Welcome to the forum and thanks for the feedback @Stephan_Bitterwolf!

extract-reads is already parallelizable, the latest releases have a n-jobs parameter to run multiple processes in parallel. The way this is done is pretty similar to what you describe, but it uses joblib to split the data into chunks and process these in parallel:

Is that basically what you had in mind? Are you using an older release of q2-feature-classifier? Or maybe you have multi-node parallelization in mind?

We always appreciate contributions from the community so there are other ways you could get involved and contribute to QIIME 2: there are other QIIME 2 plugins that could be enhanced via parallelization of some steps, as well as many open issues if you want to put out other fires!

Thanks @Stephan_Bitterwolf!

2 Likes

@Nicholas_Bokulich Thank you so much for your detailed response. This is exactly what I was searching for!

I was following the tutorial on feature classifier training. In this tutorial, there was no mention of the parallelization option. I would be happy to add that detail to the page if it is possible for me to do so.

Using the --p-n-jobs parameter of 20, it took only 6 mins to extract reads from the silva-138-99-seqs.qza file (on one core it took MUCH LONGER). It is working well! Capture|690x392

I have some other optimization questions but I will ask them in a different forum location. Thank you again for your time and helpful answer :smiley:

1 Like

That would be great! You are welcome to submit a pull request to https://github.com/qiime2/docs/ with your proposed changes and one of the team can review/suggest edits there. Please read the “contributing to the docs” section on that page first.

You should mention parallelization in a note on the page, but not as a code block — that way nobody will mistakenly run extract-reads with more jobs than they have CPUs! If you like, you could also alter the existing example code block to have --p-n-jobs 1 to make the usage transparent.

Thanks @Stephan_Bitterwolf!

2 Likes