Hi, I try to create a classifier for the complete RDP-database in the qiime vm (QIIME 2 Core 2018.6):
I take the Unaligned Bacteria 16S fasta file from here (3.8GB).
To start with the workflow, I need the database file seperated into taxonomy and otu file, doing this with a python script. It seperates the one RDP-file into two (otu-file (3.2GB) taxo-file (333MB)) regarding the correct format:
taxo example line:
494589 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Acidimicrobiales; f__Acidimicrobiaceae; g__Acidimicrobium; s__
corresponding otu line:
>494589
GCGGCGTGCTACACATGCAGTCGTACGCGGTGGCACACCGAGTGGCGAACGGGTGCGTAAC....
Importing these files into qiime artifacts works just fine, so I am able to start
qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads rdp_otus.qza --i-reference-taxonomy rdp_taxa.qza --o-classifier rdp_classifier.qza
After a while I get the following error (on screen and in log file):
indices and data should have the same size
To make sure I didn't mess it up in my converting step, I checked that the ids are the same in both files.
Furthermore I tried the same command with parts of the files (for example a third of the files, or 5/6 ...). It still worked without problems.
It seems to me that the amount of data is getting to big. But there is no memory error (my DRAM seems to be enough?)
Who knows this kind of error message?
What am I doing wrong?
Is there some kind of a limit for the maximum size of the files or the amount of the sequences?
Thank you for helping me