So I have been following the tutorial Using RESCRIPt's 'extract-seq-segments' to extract reference sequences without PCR primer pairs. in order to curate my reference database. Theoretically everything went okay. However, now I have tried to create a classier based on my curated database and the code it's crashing. It tells me that I do not have enough memory. However, I am running the analysis in quite a powerful laptop (output after free -h --giga command):
I am running qiime2 on WSL2 and I installed the latest version of it . Is there anything that I could do to run the analysis? or Do you recommend to use any computing facility ?
Since I could not train my classifier I opted to try with alignment methods like Vsearch, however the results were weird. Most of my rep sequences were not assign to any taxonomic rank. Why is this happening? For this I used the reference database and the taxonomy associated that I obtained in the tutorial mentioned above. Can both methods be combined? Only 6 out of 643 sequences were assigned to a taxonomic rank. Moreover, those sequences were assigned to ferns and in my study I expect angiosperms from Africa.
Regarding the classifier I will use the computing facilities of my university and I will tell them that I need the RAM you proposed
Regarding the taxonomic assignment:
I think I uploaded the incorrect file but the workflow it is the same. So I am interested in flowering plants. In our study we are trying to reconstruct the network of interactions between plants and their pollinating birds (i.e. African sunbirds). Thus, we sequence pollen collected from the bird's bill. Hence, I expect African plants in my results. For this purpose I downloaded from Genebank Streptophyta txid35493 then I followed all the curation steps from the tutorial. Since I did not have enough computing power I perform the taxonomic assignment with VSearch, with the following code:
This is the correct file to produce the curated reference library: Reference library.py (7.3 KB)
However, only six out 643 rep-seqs were assigned to some taxonomic rank, and these were ferns. So these results are totally meaningless for us. Any ideas why is this happening ?Can the curated database from REscript be combined with feature-classifier classify-consensus-vsearch?
Have you tried BLAST on any of these sequences? If so, do they match the assignments of the sequences that were identified by the classifier? For the others that were not assigned, what do they return as?
Would you be able to private DM me a portion of your data and the sequence and taxonomy files you are using to classify your reads? You may have to link them via Dropbox, or similar service as they might be too big to share via private DM. I'd like to look into this further.
I am wondering if we need to increase the number of iterations, and/or the --p-perc-identity of extract-seq-segments? There have been times I've had to run anywhere from 3-5 iterations. I only performed two iterations in the tutorial to keep things short and simple.
I will BLAST the sequences individually to compare the result as you suggest!! Also I will adjust the pipeline as you suggested increasing the perc-identity parameter of extract-seq-segments!
I will prepare the files and I will send to you !