feature-classifier fit-classifier-naive-bayes optimisation

David_Pearton · July 15, 2019, 8:48am

Hello,

I am running into an issue with training my classifiers. I am using qiime2 2019.4 and running it on a node in a cluster (CentOS Linux release 7.3.1611). Each node has 24 cores and 128 GB (and I've requested all of these).

I'm using the MIDORI COI unique reference set (Home) which is very large. The issue is that this takes over 12 hours to run and I keep on running out of wall time (for some reason I can't get it to run as a batch process, so I've been running it in an interactive shell with a max walltime of 12 hours).

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads MIDORI_UNIQUE_20180221_COI.qza
--i-reference-taxonomy MIDORI_UNIQUE_20180221_COI-ref-taxonomy.qza
--o-classifier MIDORI_UNIQUE_20180221_COI-classifier.qza
--verbose

I'm assuming that the process looks at the number of cores available and scales the jobs accordingly. I looked at the forum and previous posters suggested using the --p-n-jobs parameter, but this does not appear to be an option in my version of qiime2... I know I'm not running out of memory and it did work with a smaller database (MIDORI UNIQUE).

Is there any way to optimise threading or chunk size so it is able to run in a shorter time?

Thanks in advance for any suggestions.
Dave

Nicholas_Bokulich · July 15, 2019, 11:33am

Request more time. This command usually takes less time (e.g., ~1hr on 16S and ITS databases that I've worked with) but will take more with a very large database.

No, this step is unfortunately not parallelizable

That is a parameter for the classify-sklearn method, not for fitting the classifier.

Unfortunately not in the current version of QIIME 2 — you will just need to give it more time. The good news is that once the classifier is trained you can keep re-using it, and the classification step can be parallelized/optimized.

David_Pearton · July 15, 2019, 11:55am

Hi,

Thanks for the reply. I managed to wrangle more time and try rerun the job. Unfortunately it does appear as if the issue is memory. I get a memory err after a short running time. The node I'm using has 128 GB of RAM and that doesn't appear to be enough.

I will try with a fat node with 1TB RAM and see how that works...

Cheers,
Dave

Nicholas_Bokulich · July 15, 2019, 12:08pm

Yikes! Midori must be a massive database. You may want to consider dereplicating or clustering this database somehow to remove redundant records. There is not really a way to do this in QIIME 2 yet (well, you can cluster the sequences but not the taxonomies), but you can see some of the discussion in this post for some ideas:

If you are interested in doing that, I actually just thought of a way to do this with QIIME 2:

Use q2-vsearch to cluster the reference sequences (or dereplicate)
Use q2-feature-classifier's classify-consensus-vsearch to assign taxonomy to those sequences based on consensus taxonomy classification. If you use the same percent identity for clustering and taxonomy, then the consensus taxonomy will be assigned using more or less the same sequences that were clustered.
You have your new reference sequences and taxonomy to use for classification!

It will take time, but this process can be parallelized.

devonorourke · July 15, 2019, 12:18pm

I'm not here to say that your issue is or is not RAM related @David_Pearton, but I can tell you that I was able to generate my own custom COI databases qiime feature-classifier fit-classivier-naive-bayes with a single 128 Gb node on our cluster. One reference set had about 1.6 million, the other about 2.1 million. Both took around 16-24 hours.

How many sequences are in MIDORI_UNIQUE_20180221_COI?

You might also try using Terry Porter's database (paper here, database here), also derived from NCBI, but slightly more curated. It's not dereplicated though.

David_Pearton · July 19, 2019, 7:12am

Hi Devon,

Thank you for the feedback.

I am also unsure why I had a memory error. It doesn't make sense to me. I eventually got it to run on the fat node (1TB RAM) - it took 31 hours but only used 75GB of memory according to the job report...

There are only 927386 sequences in the dataset, but they are a mix of sizes. I have (eventually) managed to do an "extract reads" to cut it down to the region targeted by the primers I used and will see if (a) this helps with accuracy and (b) will speed up the training.

I will have a look at the links you provided - thank you.

I am also trying to build a more targeted training database from BOLD - we are working on marine benthos so are not interested in terrestrial vertebrates or invertebrates so that might help improve things.

It there any way of conveniently making a quiime2 taxon file from sequences derived from BOLD? I'm sure there must be a way to do this, but I'm not a programmer. Does anyone know of a script to do this?

Thanks,
Dave

Nicholas_Bokulich · July 24, 2019, 8:26pm

what format are the BOLD taxonomies in? Is there a separate taxonomy file or do these appear in the header line of the FASTA?

jasongallant · August 3, 2020, 2:34pm

Hi @David_Pearton, we are now trying to do the same in my lab.
With the latest MIDORI reference, it's been running for about 24h on a 256GB RAM machine, using about 75GB of RAM steadily, no sign of completion yet.

-Did you have any hints or scripts to facilitate this?

qiime feature-classifier fit-classifier-naive-bayes is essentially silent about progress when working on the MIDORI reference, so its hard to say its "progressing". Did you observe this?
-Did you ever succeed with BOLD? Any hints or scripts to help with loading this into qiime2?

Thanks!

Nicholas_Bokulich · August 3, 2020, 3:19pm

Welcome to the forum, @jasongallant,
Thanks for digging up this old topic... some of the info I wrote in there a year ago is impacted by recent updates.

That is correct — fit-classifier-naive-bayes does not give any "progress update" unfortunately.

I'd recommend reducing database size to reduce runtime, if possible:

use extract-reads to focus on the amplicon region you are using
remove any low-quality sequences
dereplicate the database (ideally after extracting amplicons) to reduce database size and redundancy. You can use RESCRIPt to dereplicate the sequences together with the taxonomy.

RESCRIPt also has a "get-ncbi-data" method that you can use to download data from genbank and automatically format it and import it as QIIME 2 artifacts. Since BOLD deposits their public data on genbank (all or most? not sure), it would be possible to use that to grab public BOLD data.

Good luck!