Hello, dear community!
I have been using QIIME2 since 2019 and I consider myself to be relatively experienced with the pipeline at this point. It's been a long time since the last time I trained my own classifier using SILVA database data. If I remember it correctly, it used to take around 1-2 hours to train a new classifier fully. I'm interested in comparing multiple classifiers trained using different parameters to find the optimal classification for our data. However, I'm currently training a new classifier and it is taking days long, which worries me a bit.
To be more specific, I'm executing the RESCRIPt hard mode pipeline (you can find it here: Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt). I'm running the fit-classifier-naive-bayes using the "SILVA_138.1_SSURef_tax_silva.fasta" file available at SILVA database along with its correspondent mapping, taxonomy and tree files. Also, I'm using a machine with 516 GB of RAM available, the analysis is currently stabilized at 112 GB.
It's been a long time since I ran the RESCRIPt pipeline, the SILVA version around that time was 128 (360 MB), which is basically half the size of today's SILVA 138 (698 MB). Nevertheless, it seems rather strange that it is taking so long. Even if the file size is double, the time seems to be multiplied more than ten times (the struggle)
. Am I the only one with this issue? Is it really taking much more time to train the classifier? Or is my memory incorrect?
Thank you so much for your support!