I have been using QIIME2 since 2019 and I consider myself to be relatively experienced with the pipeline at this point. It's been a long time since the last time I trained my own classifier using SILVA database data. If I remember it correctly, it used to take around 1-2 hours to train a new classifier fully. I'm interested in comparing multiple classifiers trained using different parameters to find the optimal classification for our data. However, I'm currently training a new classifier and it is taking days long, which worries me a bit.
To be more specific, I'm executing the RESCRIPt hard mode pipeline (you can find it here: Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt). I'm running the fit-classifier-naive-bayes using the "SILVA_138.1_SSURef_tax_silva.fasta" file available at SILVA database along with its correspondent mapping, taxonomy and tree files. Also, I'm using a machine with 516 GB of RAM available, the analysis is currently stabilized at 112 GB.
It's been a long time since I ran the RESCRIPt pipeline, the SILVA version around that time was 128 (360 MB), which is basically half the size of today's SILVA 138 (698 MB). Nevertheless, it seems rather strange that it is taking so long. Even if the file size is double, the time seems to be multiplied more than ten times (the struggle) . Am I the only one with this issue? Is it really taking much more time to train the classifier? Or is my memory incorrect?
If you are running fit-classifier-naive-bayes on the raw files w/o any data reduction, e.g. dereplication or quality filtering, then it'll certainly require lots of time and memory. That's, 2,224,740 sequences!
I'd highly suggest you at least run qiime rescript dereplicate ... with the --p-mode 'uniq' flag at a minimum. If that does not help, then you should consider running some of the other steps prior to this, e.g.qiime rescript cull-seqs ... as outlined in the tutorial. There are quite a few low-quality sequences in the full database, that should be removed. There is also redundant sequence data too...
Just as a reference, I can train and use the full length SILVA 138.1 classifier on my M1 Max with 64 GB RAM after performing some of the curation steps outline in the tutorial. I think it takes about 3-4 hours.
But it appears you have the memory for this... so you'll likely have to wait a long time. I can't really give an estimate, but perhaps days, as you've noted?
Hey @SoilRotifer, thank you so much for the insights!
I will try to clean the raw data as you suggested. Perhaps it will take less time. I think 3-4 hours is what I'm aiming for.
RAM is not the issue here. I was really worried about it taking much more time than I remembered. I guess a significant proportion of SILVA data is garbage then. So let's clean it up!