Feature-classifier extract-reads

Hi!

I am currently using version 2020.11 of qiime2 installed through WSL and Ubuntu. I am currently running extract-reads on the Midori reference database, and have run into a similar problem as these topics. It has been running for around 48 hours now. I’m very new to bioinformatics and was worried I had done something wrong at this step. I am attempting to trim the database with the following primers:

f-primer: GGWACWGGWTGAACWGTWTAYCCYCC
r-primer: TANACYTCNGGRTGNCCRAARAAYCA

It seems like the high level of degeneracy in the primers may be contributing to the long runtime, especially with a database as large as Midori. Has anyone else had success with trimming this database? Or found this step to later improve accuracy for the CO1 region? I was just wondering if there had been any updates with this.

Any insight is much appreciated!

3 Likes

How large is midori? My guess is "very large" because this method can take some time on very large databases :grin:

Absolutely! Longer primers and more degeneracy will increase runtime even more

You can use the --p-n-jobs parameter to run this in parallel, reducing the wait (though at 48 hr in I don't know if it's worth just waiting until completion or how much longer you have left to wait :man_shrugging:)

Generally trimming is beneficial — there are a few publications (for 16S) showing improved classification accuracy, but this will also reduce runtime downstream so is a worthwhile step to perform if this is a trimmed database that you will use again and again (for classification, alignment, filtering, or whatever)

For COI specifically, @devonorourke tested this here:

and describes steps to reproduce that database here:

1 Like

Thank you for the quick response! Midori is around 1.3 million sequences. It still hasn’t finished close to a week now. I ran the top command in another window to double check, and it seems like it’s still running. I plan on using it for classification, so I guess I should just stick it out at this point if trimming will help me later. Hopefully it finishes soon! :woman_shrugging:

1 Like

Hi there! Just as a quick update, the job is still running. Does this seem normal? None of my previous steps have taken this long, so I’m not sure if something is wrong with the job or what I should do. Thank you for your time and help with this.

Wow. This is unusually long! But as you said, midori is huge and lots of degeneracy will just slow things down… so this is longer than usual but perhaps expected.

As long as top suggests the job is still using resources then it must be slowly progressing…

Sorry you are experiencing a delay! hopefully it will finish soon.

I hope so! I’m a bit worried about how long any downstream jobs may take using this database after this. Based on this runtime, do you think attempting to train the classifier is worth it? Or is it likely it will take just as long?

runtimes will be shorter than using the untrimmed database

however, the size of the database etc will still cause some lag downstream in general compared to smaller databases... is this database clustered into OTUs to reduce redundancy? if not, you might want to check out RESCRIPt to cluster the database, saving loads of time downstream.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.