Taxonomy assignement issues

Hello Everyone,

I am new qiime2 user, I tried to assign taxonomy to my feature so I have tried:
Sklearn but at certains point it crashes ( problem of memory I have only 8GB on my machine and I alocate 4GB for my VM).
I have tried also BLAST and vserach for the first one after one hour I was oligied to quite the process (too long) the same for Vsearch.

Is there any other way to use to have my toxonomy assignement ?? could I use Qiime1 (because the last year when i used qiime1 it works).

Thanks in advance.


Hi @BiOMan,

Runtime is really going to depend on:

  1. the number of query sequences (features)
  2. the length of query/reference sequences
  3. the number of reference sequences

1 hour is really not much time for many datasets, depending on the upstream analysis steps. I would recommend letting those jobs run a little longer.

You could also try trimming your reference sequences to the same length/primer sites as the query sequences by using qiime feature-classifier extract-sequences. That would speed up blast and vsearch.

Using a smaller reference set, e.g., greengenes instead of silva, would also help.

Again, runtime is going to depend on several factors; when you did this a year ago you may have had fewer or shorter query sequences.

The uclust-based taxonomy classifier in qiime1 is much faster than other methods; you could give that a try if 1 hr is really too long to wait.

Good luck!

Hi @Nicholas_Bokulich

Thank you for your response, sorry I did not give you the details.

*For my data set it is only 9 samples (number of query sequences is 380 759).

*For the reference sequences I am using SILVA-199-99-nb-classifier.qza (my mentors insists that I use the SILVA database because we are working on Crohn’s patients gut micofolra and for them greengenes is not appropriate as a database because it is used more for ecology (can you confirm this statement?!).

*For the third point I did not really get it : do you mean that for example I am working on the V3-V3 region so I need to trim the ref seq only on this region instead of using the hole sequences in the SILVA-199-99-nb-classifier.qza??

  • For the runtime at the beginning I thought that was a bug but now it is clear for me after your explanation.

  • we figure out a solution in my lab and I will Run the SKearn algo from a cluster (so I hope no memory issues anymore). Because after that I will have a data set of 50 patients.

Thanks a lot @Nicholas_Bokulich


Greengenes was not designed specifically for non-host-associated samples, nor was SILVA designed specifically for host-associated samples. Both are quite general, so I would disagree with that statement on the grounds of “they are used more for X”. However, there are other reasons folks might choose one over the other — SILVA has been updated much more recently, so taxonomy IDs may be better updated for your system of interest.

You do not need to, but it will reduce runtime, and will (slightly) increase classification accuracy. See here for more details (we also have pre-trained V4 classifiers here if those are useful to you).

Excellent! Note that by using a cluster you can also take advantage of multiprocessing with the --p-n-jobs parameter. This will speed up time if you can harness multiple CPU on your cluster.

Hi @BiOMan,

Just to add to @Nicholas_Bokulich reply. If you check Extended Data Figure 2 from A communal catalogue reveals Earth’s multiscale microbial diversity; you will see in panel b a comparison of the 2 databases on how well they recover/match sequences when doing close reference; which could be used as a guideline to which database might be better for which kind of environment. Now, if you see panel c, you will see that Silva recruits more observed OTUs, while Greengenes more phylogenetic diversity; this is due to how these references are created; if you are interested, suggest reading more on how Silva and Greengenes are actually created.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.