How to retain only species or genus level in feature table

Hello! I’m trying to use the plugin qiime filter-table to retain only species level and only genus level. This command in tutorial
qiime taxa filter-table
–i-table table.qza
–i-taxonomy taxonomy.qza
–p-exclude mitochondria
–o-filtered-table table-no-mitochondria.qza
This command. retain only features that contain a phylum-level annotation. How can filter and retain only genus or species level? Please Write command
Thank you

Hi!
I think you can try this command

qiime taxa filter-table \
  --i-table full-table.qza \
  --i-taxonomy taxonomy.qza \
  --p-include g__ \
  --o-filtered-table filtered-table.qza

It was working with Silva138 annotated reads. If you are using another database, check the prefix for genus level and replace ‘g__’ with your prefix

1 Like

Hi @Rm733,

I’m going to build on @timanix’s excellent answer. I think that will retain anything with a g__ string, even if the full string is g__; (meaning the genus is unlabeled). You coudl also try the --p-exclude "g__;" flag, which I think should drop out things which are unannotated as well.

However, I think there’s another issue: Just because you can and you have a command, should you? (I think there’s a Jurassic Park quote in here somewhere :t_rex:). Your ability to do this efficiently depends on a lot of things. You need to work an environment where the reference databases are well annotated at genus and species level. (Specialized environmental databases may be more helpful here, but can come with draw backs for things like naive bayesian classification.) With more general databases (like greengenes) a lot of reference sequences are actually unannotated in the database, but are still biologically meaningful. (f. Lachnospiraceae and f. Christensellaceae both come to mind.) Filtering to exclude these sequences might discard important information for features. Certain organisms are really hard to resolve biologically (E. coli/Shigella is a classic example) and have been historically annotated at family level because taxonomy is a mess in general. You also have databases where there species assignment is simply unreliable (Silva) because it’s not curated. And, overall, the deeper you get into the database and phylogeny, the more disagreement there is between different sets of names. Taxonomy is one more major frontier. It doesn’t mean that a :rose: called f__Rosaceae would smell less sweet.

Best,
Justine

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.