Problem filtering features from a table

Hei,

I am trying to remove unwanted features from my feature table based on taxonomy.
I have classified my features with the greengenes classifier and wanted to remove features with no assigned taxonomy or only assigned at the domain-level.
I managed to remove most of them using

qiime taxa filter-table --i-table my_table.qza --i-taxonomy my_taxonomy.qza --p-exclude 'Unassigned','d__Bacteria','d_Archaea' --p-mode 'exact' --o-filtered-table filtered_table.qza

However, my taxa barplot still shows taxa with the taxonomy 'd__Bacteria;p__;c__;o__;f__;g__;s__'.

How can I remove those? I have tried --p-exclude 'd__Bacteria;p__;c__;o__;f__;g__;s__' with both the 'exact' and 'contains' mode, but neither removed those taxa.

Hi @k.kujala and welcome to the :qiime2_square: forum!

So you want to keep features like:

  • d__Bacteria;p__Firmicutes;;;;;
  • d__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Neisseriales;f__Neisseriaceae;g__Neisseria;s__subflava

And remove features annotated like:

  • d__Bacteria;;;;;;
  • Unassigned;;;;;;

Note that when a taxonomic level is present in an annotation we have their prefix in the string. For example, all features with at least a phylum level assignation contains the string p__ (note the double underscore). When a taxonomic assignation lacks phylum-level info, we don't have that string. We can use that for filtering:

qiime taxa filter-table \
  --i-table my_table.qza \
  --i-taxonomy my_taxonomy.qza \
  --p-include p__ \
  --p-mode contains \
  --o-filtered-table filtered_table.qza

Option --p-include p__ basically means that all features without annotations at least as specific as phylum level will be discarded. We need to use --p-mode contains in order to look for partial matches of the unassigned taxa level we are looking for (see also the filtering tutorial for more info and references).

Best,

Sergio

1 Like

Hei,

thank you for the answer. Unfortunately, it doesn't work.

To give a bit more detail, before filtering I had two types of Bacteria-taxa that didn't have taxonomy assigned at the phylum level. In the phylum-level taxa barplot those were shown as
'd__Bacteria;' and 'd__Bacteria;p'
The first one is removed by --p-include p__ , but the second one is not.

The first one is indeed a feature without phylum level classification (and the command I provided to you removes them). The second one is a feature that is actually classified at the phylum level, but it doesn't have a name. You can read more about this in this post.

If you also want to get rid of those, one idea that comes to my mind is to use RESCRIPt first to edit the taxonomy:

qiime rescript edit-taxonomy \
    --i-taxonomy my_taxonomy.qza \
    --p-search-strings p__; \ 
    --p-replacement-strings  p__ExcludeMe; \ 
    --o-edited-taxonomy my_taxonomy_edited.qza

And then exclude them directly with another qiime taxa filter-table (this time using --p-exclude):

qiime taxa filter-table \
  --i-table filtered_table.qza \
  --i-taxonomy my_taxonomy_edited.qza \
  --p-exclude p__ExcludeMe \
  --p-mode contains \
  --o-filtered-table double_filtered_table.qza

Best,

Sergio

Hei,
thanks for the explanation. I have managed to filter them by using the --p-exclude 'd__Bacteria; p__; c__; o__; f__; g__; s__' (the space after the ; is what had been missing in my previous attempts for which I had copied the taxon "name" directly from taxa barplot). As I found the assignment to a "nameless" taxon a bit weird, I also pulled out the sequences and reclassified them using consensus-blast with very different results: some where "Unassigned", but the majority were clearly assigned at least on the phylum level (some even down to species), and to quite different phyla as well.
I'm contemplating using consensus-blast instead of classify-sklearn for my whole dataset now.

Thanks,
Katharina

Hi again @k.kujala

I'm happy you got it working!

If you want more information in order to make a decision, you may want to check this post:

Best,

Sergio