I am trying to remove unwanted features from my feature table based on taxonomy.
I have classified my features with the greengenes classifier and wanted to remove features with no assigned taxonomy or only assigned at the domain-level.
I managed to remove most of them using
However, my taxa barplot still shows taxa with the taxonomy 'd__Bacteria;p__;c__;o__;f__;g__;s__'.
How can I remove those? I have tried --p-exclude 'd__Bacteria;p__;c__;o__;f__;g__;s__' with both the 'exact' and 'contains' mode, but neither removed those taxa.
Note that when a taxonomic level is present in an annotation we have their prefix in the string. For example, all features with at least a phylum level assignation contains the string p__ (note the double underscore). When a taxonomic assignation lacks phylum-level info, we don't have that string. We can use that for filtering:
Option --p-include p__ basically means that all features without annotations at least as specific as phylum level will be discarded. We need to use --p-mode contains in order to look for partial matches of the unassigned taxa level we are looking for (see also the filtering tutorial for more info and references).
thank you for the answer. Unfortunately, it doesn't work.
To give a bit more detail, before filtering I had two types of Bacteria-taxa that didn't have taxonomy assigned at the phylum level. In the phylum-level taxa barplot those were shown as
'd__Bacteria;' and 'd__Bacteria;p'
The first one is removed by --p-include p__ , but the second one is not.
The first one is indeed a feature without phylum level classification (and the command I provided to you removes them). The second one is a feature that is actually classified at the phylum level, but it doesn't have a name. You can read more about this in this post.
If you also want to get rid of those, one idea that comes to my mind is to use RESCRIPt first to edit the taxonomy:
Hei,
thanks for the explanation. I have managed to filter them by using the --p-exclude 'd__Bacteria; p__; c__; o__; f__; g__; s__' (the space after the ; is what had been missing in my previous attempts for which I had copied the taxon "name" directly from taxa barplot). As I found the assignment to a "nameless" taxon a bit weird, I also pulled out the sequences and reclassified them using consensus-blast with very different results: some where "Unassigned", but the majority were clearly assigned at least on the phylum level (some even down to species), and to quite different phyla as well.
I'm contemplating using consensus-blast instead of classify-sklearn for my whole dataset now.