Hi! I am using qiime2 v.2024.2 off a super computing cluster (conda module). I've tried to find a way to set a seed to generate reproducible results for the qiime feature-classifier classify-sklearn function, but I cannot find any discussion on the topic and there's nothing stated in the help page of the function. Is there a way to make the results reproducible?
I would also like to get some input into how I could optimize the classifier, since I have a problem with a recurring memory overflow error from the computing cluster whenever I give the classifier anything more than approx 15k reads at a time. My entire dataset is 4,2 million reads, so in 15k read chunks, it'll take me forever... (I understand that this is mostly an issue of the cluster, but maybe there's a way to optimise this script for a wiser use of resources too).
The exact commands I am using are:
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref_sequences.qza
--i-reference-taxonomy taxonomy.qza
--o-classifier bayesian_classifier_sklearn.qza
I dont have an answer for setting a seed, but I do have a question about yoru data:
As I read this, I'm shocked that you have 4.2M ASV sequences that need to be classified. Are you possibly doing metagenomics? In that case, something like q2-shogun, or one of the metagenomic tools will serve you better.
Are you working on raw reads? If thats the case, I would recommend quality filtering and dereplicating at a minimum, and would probably suggest denoising. If you have the same read 10000 times, across your data set, there's no reason to classify it individually when you could classify it once.
I feel like a 15,000 read chunk is pretty on par for my bigger (1000s of samples) datasets, although someone like the EMP team could probably give better details of how many ASVs they saw in 2017.
I understand if this number surprises you, but what I have sequenced is ITS-amplicons of 141 eDNA samples, sequenced through Nanopore. Since Nanopore has such high error rates, this will make clustering less efficient. Before dereplication, I had about 13M+ reads, and before removing chimeras I had 7.2M reads. So filtering, dereplication and removing chimeras has been done. Due to the high error rates of Nanopore, most bioinformatics tools are badly suited for this data too.
Thanks for that, I will try using a 15k read chunk and see where that lands me.
I can answer the first part of your question, to add to @jwdebelius 's great answers about the second part.
This classifier is already deterministic, and teh results should be fully reproducible when the same inputs are used, and when the read orientation is the same. So there is not an option for setting a random seed because there is no randomization step.