I am currently using version 2020.11 of qiime2 installed through WSL and Ubuntu. I am currently running extract-reads on the Midori reference database, and have run into a similar problem as thesetopics. It has been running for around 48 hours now. I’m very new to bioinformatics and was worried I had done something wrong at this step. I am attempting to trim the database with the following primers:
It seems like the high level of degeneracy in the primers may be contributing to the long runtime, especially with a database as large as Midori. Has anyone else had success with trimming this database? Or found this step to later improve accuracy for the CO1 region? I was just wondering if there had been any updates with this.
How large is midori? My guess is “very large” because this method can take some time on very large databases
Absolutely! Longer primers and more degeneracy will increase runtime even more
You can use the --p-n-jobs parameter to run this in parallel, reducing the wait (though at 48 hr in I don’t know if it’s worth just waiting until completion or how much longer you have left to wait )
Generally trimming is beneficial — there are a few publications (for 16S) showing improved classification accuracy, but this will also reduce runtime downstream so is a worthwhile step to perform if this is a trimmed database that you will use again and again (for classification, alignment, filtering, or whatever)
Thank you for the quick response! Midori is around 1.3 million sequences. It still hasn’t finished close to a week now. I ran the top command in another window to double check, and it seems like it’s still running. I plan on using it for classification, so I guess I should just stick it out at this point if trimming will help me later. Hopefully it finishes soon!
Hi there! Just as a quick update, the job is still running. Does this seem normal? None of my previous steps have taken this long, so I’m not sure if something is wrong with the job or what I should do. Thank you for your time and help with this.
I hope so! I’m a bit worried about how long any downstream jobs may take using this database after this. Based on this runtime, do you think attempting to train the classifier is worth it? Or is it likely it will take just as long?
runtimes will be shorter than using the untrimmed database
however, the size of the database etc will still cause some lag downstream in general compared to smaller databases… is this database clustered into OTUs to reduce redundancy? if not, you might want to check out RESCRIPt to cluster the database, saving loads of time downstream.