Previously I brought up this topic, but it could not be resolved:
I am quite a newbie to Qiime2 and I seem to have run into a potential problem regarding the qiime feature-classifier extract-reads.
I am experiencing a very lengthy extract-reads, I have not experienced this problem using my other primers (18S) with other databases (PR2 and SILVA). I am running my pipeline through a high performance computer (so computing power is not a problem - currently using mem-per-cpu=10GB and cpus-per-task=16) and generally takes ~2 hours to run. The database I am using now (Midori) is double the size of the others, but I do not understand how 48 hours is not sufficient for it to run.
The memory is fine, slurm output says job just ran out of time. I have been running other jobs after this and there is no problems regarding memory.
It is just very frustrating when I am tweaking the pipeline and have to wait for the end result to see how it influences the results.
Update: it did not complete running in a week either
I have uncovered if I ignore the step feature-classifier extract-reads I can get the results within 24 hours for COI. As 18S worked with the previous method, I tested and compared results and they do differ (generally when down to genus/species, so for 90% of the case I can see how they could potentially be compared) . So I would just like advice on if it is advisable to do this, as I see no other option?
Hi @Aimee, extract-reads is definitely not necessary, and the advantages that we see for 16S may not generalize to other marker genes (as we note here). It gives a small boost in accuracy, but that is not worth the wait time you are experiencing for your COI database.
I would definitely recommend just proceeding without trimming — at worst, there will be a slight accuracy decrease at species level.
I just wanted to let you know I’m in a somewhat similar situation, building a classifier based on a very large (1.3M) COI reference database. Extract reads DID run for me, in under an hour, using a fairly standard desktop machine. My primers have only 3 total degeneracies however.
If you’re still interested in extracting using your primers, maybe you could try the process on a severely subsampled test database (10 sequences) just to see if it is possible? Also, could the cutadapt plugin (or just standalone) be used as an alternative?
Great suggestion. Standalone cutadapt could be used as an alternative, and it's not a big issue since you would be trimming these sequences before importing to QIIME2, anyway. It would not be stored in provenance, but it's not as much of an inconvenience as, say, needing to export a file to process then re-import. q2-cutadapt is not designed to work on these specific data types and will not work (currently it is only designed to handle fastq data).
This was one of the first things I tried and I was just not happy with how long it took using the sub-sampled database.
I am happy with the results I obtained without using extract-reads, as I compare the results with more than one database. For me the main objective was to alter my pipeline as little as possible and so taking out a step was easier than putting more in. But yes, to correct myself previously there are other options and there are definitely more ways to get around the problem I was having.