Hi I’m a newbie with QIIME2 and would like to analyse the diversity of a microbiome. In my particular case, I am not only interested in bacteria, but also in fungii and other microorganisms. For this reason, I merged the fasta files from greengenes 13_8 (99%), UNITE and SILVA_128 (99%) databases, so that I could capture as much as possible. This huge (800~ Mb) fasta file was then imported with qiime:
You should not merge these databases. These cover different marker genes (16S rRNA, fungal ITS, 18S rRNA) but you will almost certainly be amplifying/sequencing a single marker gene at a time. So keeping these separate will increase the diagnostic power of each. For a given marker gene, you only want to classify against a database for that marker gene — otherwise the results at the other end may be garbage. For example, 16S rRNA gene primers should not amplify fungal ITS. If you get hits to fungal ITS genes for whatever reason, those results are meaningless. Merging reference datasets adds unnecessary noise, potentially decreasing accuracy.
Use these databases separately on the appropriate marker genes. E.g., you are probably sequencing ITS and 16S amplicons separately — use UNITE and Greengenes or SILVA separately on the appropriate datasets.
Yes — even SILVA database on its own is often too big for some users to train on their personal computers (greengenes and UNITE are usually fine on a laptop). We provide pre-trained classifiers for Greengenes and SILVA to save some users the trouble (and memory requirements) of training their own. This is yet another reason to not merge multiple disparate reference datasets — there will be lots of redundant information (e.g., between SILVA and Greengenes), increasing computational demands while actually decreasing the quality of your results.