feature-classifier fit-classifier-naive-bayes optimisation

Yikes! Midori must be a massive database. You may want to consider dereplicating or clustering this database somehow to remove redundant records. There is not really a way to do this in QIIME 2 yet (well, you can cluster the sequences but not the taxonomies), but you can see some of the discussion in this post for some ideas:

If you are interested in doing that, I actually just thought of a way to do this with QIIME 2:

  1. Use q2-vsearch to cluster the reference sequences (or dereplicate)
  2. Use q2-feature-classifier's classify-consensus-vsearch to assign taxonomy to those sequences based on consensus taxonomy classification. If you use the same percent identity for clustering and taxonomy, then the consensus taxonomy will be assigned using more or less the same sequences that were clustered.
  3. You have your new reference sequences and taxonomy to use for classification!

It will take time, but this process can be parallelized.

1 Like