I am new qiime2 user, I tried to assign taxonomy to my feature so I have tried:
Sklearn but at certains point it crashes ( problem of memory I have only 8GB on my machine and I alocate 4GB for my VM).
I have tried also BLAST and vserach for the first one after one hour I was oligied to quite the process (too long) the same for Vsearch.
Is there any other way to use to have my toxonomy assignement ?? could I use Qiime1 (because the last year when i used qiime1 it works).
Thank you for your response, sorry I did not give you the details.
*For my data set it is only 9 samples (number of query sequences is 380 759).
*For the reference sequences I am using SILVA-199-99-nb-classifier.qza (my mentors insists that I use the SILVA database because we are working on Crohn’s patients gut micofolra and for them greengenes is not appropriate as a database because it is used more for ecology (can you confirm this statement?!).
*For the third point I did not really get it : do you mean that for example I am working on the V3-V3 region so I need to trim the ref seq only on this region instead of using the hole sequences in the SILVA-199-99-nb-classifier.qza??
For the runtime at the beginning I thought that was a bug but now it is clear for me after your explanation.
we figure out a solution in my lab and I will Run the SKearn algo from a cluster (so I hope no memory issues anymore). Because after that I will have a data set of 50 patients.
Greengenes was not designed specifically for non-host-associated samples, nor was SILVA designed specifically for host-associated samples. Both are quite general, so I would disagree with that statement on the grounds of “they are used more for X”. However, there are other reasons folks might choose one over the other — SILVA has been updated much more recently, so taxonomy IDs may be better updated for your system of interest.
You do not need to, but it will reduce runtime, and will (slightly) increase classification accuracy. See here for more details (we also have pre-trained V4 classifiers here if those are useful to you).
Excellent! Note that by using a cluster you can also take advantage of multiprocessing with the --p-n-jobs parameter. This will speed up time if you can harness multiple CPU on your cluster.
Just to add to @Nicholas_Bokulich reply. If you check Extended Data Figure 2 from A communal catalogue reveals Earth’s multiscale microbial diversity; you will see in panel b a comparison of the 2 databases on how well they recover/match sequences when doing close reference; which could be used as a guideline to which database might be better for which kind of environment. Now, if you see panel c, you will see that Silva recruits more observed OTUs, while Greengenes more phylogenetic diversity; this is due to how these references are created; if you are interested, suggest reading more on how Silva and Greengenes are actually created.