I am working with a full-length 16S database (from Metasquare, which merges several collections) and I have already completed the import steps in QIIME 2. The problem is that training the classifier is taking a very long time—it has been running since the day before yesterday—and I couldn't find any option to parallelize the fit-classifier-naive-bayes command.
Given that I currently have 1 TB of RAM available, I would like to know:
Does anyone know if there is a publicly available .qza file with a pre-trained classifier using full-length 16S sequences?
If it doesn't exist, what strategies do you recommend to optimize or speed up training in environments with abundant RAM, even though QIIME 2 does not allow direct parallelization?
Any examples of users who have divided their data (e.g., into regions, subsets) and then assembled classifiers or any workarounds that have worked well?
I appreciate any guidance, examples of workflows, or links to similar resources.
Yes, it can take a while. However, it often helps to remove redundant and low-quality sequences prior to training. Often there are many identical sequences with the same taxonomy. Removing these, and performing other quality control steps prior to training will help reduce the database size and memory footprint, which will enable faster training.
I suggest looking through this RESCRIPt tutorial for general ideas…, and skip to the dereplication and cull sequence steps. Also, you can drastically speed things up by making an amplicon specific classifier, by extracting the amplicon region, dereplicating the extracted amplicon region, remove low quality sequences, then train.
Keep in mind the tutorial is not necessarily structured as a standard operating procedure (SOP). Its main purpose is to provide command examples. You can carry out the commands in many different ways. Feel free to alter the order of the commands, etc…
I am wondering if increasing the --p-classify--chunk-size will help with training speed? Can anyone else provide insight on this, or other ideas?
SoilRotifer, thank you for your suggestion. After doing some research, and in the interest of saving time, the best or fastest approach is to not use a pre-trained database for full-length 16S, but instead to assign taxonomy with VSEARCH (which is faster and more stable for full-length 16S). Of course, before this, it's necessary to import my sequences in FASTA format and my taxonomy file to get the required .qza files to run: qiime feature-classifier classify-consensus-vsearch (this way, a separate classifier .qza is not needed).
I highly suggest this paper, it has some nice benchmarks.
I should point out that when comparing vsearch to sklearn one should only compare the classification time. Yes, training the classifier can take a while, but once it has been trained it can be as fast or faster than vsearch for classification. Again, this all depends on how the database has been curated and prepared.
classify-consensus-vsearch can be computationally intensive, especially for large datasets. It can be slower than classify-sklearn for large-scale taxonomic assignment, especially for calculating consensus determination.
I've experienced this exact situation myself on several occasions with large reference taxonomy databases. That is, I gained back the time lost from sklearn training, when using the classifier on many projects. Train once, use many times.
Again, your mileage may vary depending on the size and scope of the reference data, and how it has been curated. I like to avoid making general statements about classification times between methods.
Also, the other suggestions I mentioned above still apply. Curating your reference database prior to use in classification is important. For example, a reference database that has an over-abundance of representative sequences from some groups over others, i.e. bias, can be problematic...
Okay, enough of my rambling.
Anyway, it seems like your off to a good start. Keep us posted!
Thank you Colinbrislawn, I had already reviewed the Silva database for classify-sklearn. I know that Silva is one of the most popular databases, but it has certain limitations. That's why I wanted to explore the possibility of using the Metasquere database, which is a "compilation" and integrates more databases.
Hi, and thank you againt.
Yes, I reduced the redundancy of the sequences I used to train the classifier. It has been training for over a week and still hasn't generated any files. So here I am, with my coffee (and a good dose of patience), waiting for it to finish.
I understand that your methodologies and approaches are very different.
On the other hand, I am now running VSEARCH with my sequences. It has also taken quite a while (it's been 2 days), even though I used more than 70 processors. As an extra note, I should mention that it is consuming a lot of RAM (over 500 GB), is a real memory hog, haha.
Given this situation, I decided to return to the previously despised Silva database. I just executed it, and as you mentioned, SoilRotifer, it is very fast, and I already have my relative abundance table by sample. However, I will continue to wait for the training of the other database. And yes, I agree: Train once, use many times.
Yeah, I consensus vsearch can use a lot of resources.
I often try to find a small computer with a fast processor with ~32-64 GB RAM to train my classifiers. I am able to train a full SILVA DB on my M1 Max Macbook Pro in less than a few hours. The HPC that I have access too, has very slow and old CPUs... thus, it usually takes a couple days for that to perform the same task.
Have you tried the GTDB database? You can use RESCRIPt to download the 226 version and see how that performs.
The same thing happens; I also work on an old machine and it is slow. My laptop is also old and although I keep it updated I don't think it makes much difference. I work in a low budget lab and there are no resources to upgrade the equipment, haha.
I started researching Rescript and liked it, so I'm going to try it with a large database like the one I want to use.
Yes, I'm working with full-length nanopore sequences for the first time...noticeable, right?
I've been reviewing how to do the taxonomic task with classify-consensus-blast, and I've noticed that it's as slow as classify-consensus-vsearch, which is a disadvantage at the moment. With classify-consensus-vsearch I used the parameters --p-perc-identity 0.9, --p-min-consensus 0.6 and --p-query-cov 0.8, in addition to the default kmer. I didn't realize, until I saw your post, that the size of the sequences (mine are over 1500bp) and the size of --wordlength significantly affect performance, which explains the high RAM consumption.
Additionally, I noticed that it does not provide allocation probabilities, which leads me to rule out this strategy. I think I'll go with the strategy of training my own database with Rescript.