Since I could not train my classifier I opted to try with alignment methods like Vsearch, however the results were weird. Most of my rep sequences were not assign to any taxonomic rank. Why is this happening? For this I used the reference database and the taxonomy associated that I obtained in the tutorial mentioned above. Can both methods be combined? Only 6 out of 643 sequences were assigned to a taxonomic rank. Moreover, those sequences were assigned to ferns and in my study I expect angiosperms from Africa.
Regarding the classifier I will use the computing facilities of my university and I will tell them that I need the RAM you proposed
Regarding the taxonomic assignment:
I think I uploaded the incorrect file but the workflow it is the same. So I am interested in flowering plants. In our study we are trying to reconstruct the network of interactions between plants and their pollinating birds (i.e. African sunbirds). Thus, we sequence pollen collected from the bird's bill. Hence, I expect African plants in my results. For this purpose I downloaded from Genebank Streptophyta txid35493 then I followed all the curation steps from the tutorial. Since I did not have enough computing power I perform the taxonomic assignment with VSearch, with the following code:
However, only six out 643 rep-seqs were assigned to some taxonomic rank, and these were ferns. So these results are totally meaningless for us. Any ideas why is this happening ?Can the curated database from REscript be combined with feature-classifier classify-consensus-vsearch?
Have you tried BLAST on any of these sequences? If so, do they match the assignments of the sequences that were identified by the classifier? For the others that were not assigned, what do they return as?
Would you be able to private DM me a portion of your data and the sequence and taxonomy files you are using to classify your reads? You may have to link them via Dropbox, or similar service as they might be too big to share via private DM. I'd like to look into this further.
I am wondering if we need to increase the number of iterations, and/or the --p-perc-identity of extract-seq-segments? There have been times I've had to run anywhere from 3-5 iterations. I only performed two iterations in the tutorial to keep things short and simple.
I will BLAST the sequences individually to compare the result as you suggest!! Also I will adjust the pipeline as you suggested increasing the perc-identity parameter of extract-seq-segments!
I will prepare the files and I will send to you !