MemoryError When Training UNITE ver7 01.12.2017 Classifier

Nicholas_Bokulich · January 28, 2018, 2:36am

Hi @Sydney_Morgan,
Thanks for posting!

For bacterial 16S rRNA reads, we see a performance boost when the feature classifier is trained on extracted sequence reads, compared to the near-full-length 16S rRNA gene sequences. For fungal ITS reads, we see a performance decrease upon extraction.

The reason for this is primarily because the reference database is composed of ITS sequences amplified by a range of different primers, and hence do not overlap 100%. Depending on the primers that you choose, many of the reference sequences will fail to extract simply because that primer sequence is not in those particular reference sequences, not necessarily because the primer does not amplify that species. It also does not help that UNITE trims its sequences to remove flanking rRNA gene regions (which contain primer sites) — you must use the "developer" version of the database to retrieve the full-length reads (which still suffer from the issue I've described above).

How much memory are you allocating to the virtual machine? If you can allocate more, do. In my experience, it does not take much memory (< 8 GB) to train a UNITE classifier. An even lower chunk size may help; a different machine may be the last resort if all else fails. You could also check out these previous forum posts (here and here) to see if others have offered additional solutions.

I hope that helps! Good luck!