Combine different reference sequences databases

jacorvar · February 3, 2018, 2:43pm

Hi I'm a newbie with QIIME2 and would like to analyse the diversity of a microbiome. In my particular case, I am not only interested in bacteria, but also in fungii and other microorganisms. For this reason, I merged the fasta files from greengenes 13_8 (99%), UNITE and SILVA_128 (99%) databases, so that I could capture as much as possible. This huge (800~ Mb) fasta file was then imported with qiime:

qiime tools import \
        --type 'FeatureData[Sequence]' \
        --input-path $refSeqs/otus.fasta \
        --output-path otus.qza

I did the same for the "taxonomy" files from the previous databases.

This ran ok, but then, when I run the classifier, after a while, I got the following:

$ qiime feature-classifier fit-classifier-naive-bayes \
        --i-reference-reads ref-seqs.qza \
        --i-reference-taxonomy ref-taxonomy.qza \
        --verbose \
        --o-classifier classifier.qza
Plugin error from feature-classifier: [Errno 28] No space left on device

Could this happen because the input files containing the taxonomy and reference sequences are too big? Should I in such case train the classifier for each database separately?

Thanks

Nicholas_Bokulich · February 3, 2018, 2:53pm

Hi @jacorvar,

You should not merge these databases. These cover different marker genes (16S rRNA, fungal ITS, 18S rRNA) but you will almost certainly be amplifying/sequencing a single marker gene at a time. So keeping these separate will increase the diagnostic power of each. For a given marker gene, you only want to classify against a database for that marker gene — otherwise the results at the other end may be garbage. For example, 16S rRNA gene primers should not amplify fungal ITS. If you get hits to fungal ITS genes for whatever reason, those results are meaningless. Merging reference datasets adds unnecessary noise, potentially decreasing accuracy.

Use these databases separately on the appropriate marker genes. E.g., you are probably sequencing ITS and 16S amplicons separately — use UNITE and Greengenes or SILVA separately on the appropriate datasets.

Yes — even SILVA database on its own is often too big for some users to train on their personal computers (greengenes and UNITE are usually fine on a laptop). We provide pre-trained classifiers for Greengenes and SILVA to save some users the trouble (and memory requirements) of training their own. This is yet another reason to not merge multiple disparate reference datasets — there will be lots of redundant information (e.g., between SILVA and Greengenes), increasing computational demands while actually decreasing the quality of your results.

I hope that helps!

system · March 6, 2018, 8:53pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.