Does QIIME 2 support partial data analysis？

hah606 · December 16, 2021, 12:16pm

Hi, I'm studying Qiime2. However, I think part of the analysis is taking too long, so I decided to use parallel computing for each sample. I hope to know if QIIME 2 support partial data analysis? For example, I import 20 samples and I only want to do dada2 denoise-paired for the first sample. Can I change any arguments to meet my requirement? Thanks so much.

Keegan-Evans · December 20, 2021, 4:48pm

@hah606,

You can subset your data, however, the error model is built per sequencing run, so the more samples from your run that you keep, the more accurate the error model is. Also, generating the error model is a fairly time consuming step that you will be repeating for each sample if you run a separate denoising step for each sample.

If you are only interested in a few of the samples, you can use demux filter-sample (docs). However if you are interested in all of your samples, I think you are better off trying to optimize a single run. The first thing is to make sure the pooling mode is set to independent: --pooling-mode independent, this should be the default, but it is worth making sure that it is actually set.

Next I would bump up the number of threads used to perform the denoising. This allows more than one sample to be denoised at the same time. By default this is set at 1. You can use all of the cores on your machine by passing 0 in. This can bog things down a bit, so it is usually best to specify a number that is 1 or 2 less than the number of cores that you have. For an 8 core machine: --p-n-threads 6.

If you do have a large number of samples that you need to do and these suggestions are not enough you could look into getting time on a more powerful machine of some sort. If you are at a university there may be a high performance computing cluster or similar available to you. Alternatively you could use a service like AWS.

hah606 · December 21, 2021, 9:51am

Thanks for your response. Yeah, I notice we can use --p-n-threads when doing the dada2 denoise-paired. However, how about (feature-classifier classify-consensus-blast)? There seems no such option that can make it faster. Usually, It takes a long time in blast step.

Keegan-Evans · January 3, 2022, 5:04pm

@hah606,

Unfortunately, the version of BLAST that feature-classifier currently uses cannot currently utilize multiple threads when given a "subject"(that is database sequences). Since classify-consensus-blast does supply a "subject" it is not possible to increase the number of threads.

classify-consensus-vsearch and classify-sklearn do provide the option of multiple threads. This can be very memory intensive with classify-sklearn, as each thread loads the entire classifier into memory, meaning that realistically this is only an option when using a cluster with lots and lots of memory.

You might want to check out the overview section on taxonomy classification for more information on the different methods available.

system · February 3, 2022, 11:06pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.