Hopefully this is a novel question.
I am currently utilizing qiime2 2022.2.1 both on a virtualbox locally, and on a computing core.
For background, I have around 5.8 million reads and want to determine the quality for truncation. The default of 10,000 reads is therefore relatively low to the total. I ran the code below (at 5 million and 1 million) on my local virtualbox and the output was 'Killed'. I assumed that this was due to local computing power so I connected to my computing core, loaded qiime2 2022.2 and ran both at 5 million and 1 million reads. The output for both was 'Killed'. Is there a maximum threshold for demux summarize?
(In addition, I ran the following to ensure that my file was not corrupt:
qiime tools validate demux-single-end.qza
output:
Result demux-single-end.qza appears to be valid at level=max.)
Additionally, I ran the verbose command on the computing core and this was the core and output:
code:
qiime demux summarize --i-data demux-single-end.qza
--p-n 1000000
--verbose
--o-visualization demux_deep.qzv \
I would hazard a guess that you ran out of memory with the larger subsample sizes. Otherwise, there's no fundamental threshold on that, just a point of strongly diminishing returns per Joules of energy burned.
It might be worth mentioning that the goal (and only purpose of that param) is to see what the distribution of quality scores are in your run. For this, you really don't need a stupendous amount of the data, as they tend to follow the same profile on average. If you felt that 10k isn't enough to gauge the q-scores, then you might try 50k or even 100k. But I think 1MM and 5MM exceed what we expected when that code was written.
Thanks for the response! The reason I am looking to plot such a large number is that I have also plotted these via R (plotting the reads for a subset of samples [i.e. - 10 samples]). When doing this, I noticed that my qualities were quite different across samples. I plan to plot all the samples individually via R, but I also wanted to take a quick look at the majority of my reads. I assumed that utilizing a computing core would suffice for memory, but you may be correct. Thanks for the help; I'll try a smaller subset (100k) which should more reasonably run.