Training a large Custom Database

URB · May 3, 2021, 3:41pm

Hello,

I have created a custom DB, which includes curated 16S from SILVA and RDP Bacteria and Archaea reads and the fungal ITS UNITE reads. I curated them separately and extracted common and unique reads based on family nomenclature and removed those which do not have proper nomenclature. The purpose of doing this is then the chances of getting false positives will be less.

I have formatted the fasta and taxonomy file, and ran the classifier on the individual Archaea (SILVA + RDP reads) and UNITE reads successfully.

But, when I am running the combined Bacterial reads from SILVA and RDP (2615591) for testing purposes. It is taking too long for the classifier to complete. Also, In the Archaea samples I have combined RDP and SILVA reads and they ran properly, so formatting should not be an issue.

I am running on my organization's HPC high_mem node for the past 4 days. I am worried, how much time will it take to run the complete custom DB with 3320193 reads.

Do you have any suggestions to get it done on time or at least know if it is running properly and if this is a feasible way of doing it?

following is the command I used:
#!/bin/bash -l

# -cwd # -q highmem.q
module load qiime2/latest

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads Bac_all.qza
--i-reference-taxonomy Bac_all_tax.qza
--o-classifier Bac_all_classifier.qza![Screenshot (46)|690x388]

SoilRotifer · May 3, 2021, 4:12pm

Hi @URB, welcome to :qiime2:!

That is an awful lot of reference reads. I'd suggest dereplicating the reference data to make the database much smaller. In fact, this is a great task for RESCRIPt, you can find more details here:

In particular read the part about Dereplication of sequences and taxonomy. If you know what amplicon region you are using, then you can shrink the reference database even further by extracting the amplicon region from your reference database, then dereplicate that output. This will save you a lot of memory and run time.

-Mike

URB · May 10, 2021, 7:48pm

Hello Mike, I tried to download the minimal Rescript environment using conda on HPC, but I got the following error.

Preparing transaction: done
Verifying transaction: done
Executing transaction: failed

ERROR conda.core.link:_execute(502): An error occurred while installing package 'conda-forge::async_generator-1. 10-py_0'.
FileNotFoundError(2, "No such file or directory: '/scixxx/home-pure/xxx/anaconda3/envs/rescript/bin/python3.6' ")
Attempting to roll back.

Rolling back transaction: done

FileNotFoundError(2, "No such file or directory: '/scixxx/home-pure/xxx/anaconda3/envs/rescript/bin/python3.6' ")

Any suggestions or any alternative way to install Rescript.

Thank you - URB

SoilRotifer · May 10, 2021, 8:11pm

It looks like you are trying to install into the main HPC conda system? If so, you'll likely not be able to do this unless you have file permissions, and will have to ask the IT admins.

You can try installing conda into your user account, or at least allowing the HPC conda setup to read and use your own conda environments. This usually requires running a conda init command. I'd ask your IT admins for help on this.

I just double-checked both install options for RESCRIPt, and both are working as intended.

-Mike

system · June 11, 2021, 2:12am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.