Training a large Custom Database

SoilRotifer · May 3, 2021, 4:12pm

Hi @URB, welcome to !

That is an awful lot of reference reads. I'd suggest dereplicating the reference data to make the database much smaller. In fact, this is a great task for RESCRIPt, you can find more details here:

In particular read the part about Dereplication of sequences and taxonomy. If you know what amplicon region you are using, then you can shrink the reference database even further by extracting the amplicon region from your reference database, then dereplicate that output. This will save you a lot of memory and run time.

-Mike