In a dataset I am analyzing there are host mitochondrial DNA reads that are not being labeled by taxonomy assignment. I am trying to filter these out, but because they are not given a specific identifier I cannot filter using taxonomy-based filtering. I also cannot use metadata-based filtering because I just want to filter out features, not samples.
I identified the feature IDs that I want to remove using NCBI, for example, feature db0df656e84add2d179708d75b96b2b0. I just can’t find a tag that I can use in the filter-table command to indicate that I want this feature removed.
There are a couple of ways to do this, but none of them are perfect.
One option is to use a plug-in like deblur denoise-16S, that uses a ‘positive filter’ method that throws out any reads that don’t look like 16S reads.
Another option would be to filter your fastq / fasta files before you import them into an qiime artifact. For example, you could use bbduk2 or vsearch to rapidly scan all your reads, removing any that match to your host mitochondrial DNA.
I’m sure there are more ways to do this. Let’s see what folks recommend.
What are these features being labelled as as far as taxonomy goes? I’m wondering if you’d be able to use the ‘qiime taxa filter-seqs’ option as per the tutorial here.
I haven’t used this myself personally but I wonder if you could use the --p-include option to only retain features with lets say ‘p__’ taxonomy which would filter out all the rest if they are not assigned anything. Or alternatively use the --p-exclude option to remove specific features.
Thanks @colinbrislawn! The problem is that these sequences are technically 16S, since they’re associated with mitochondria, so they were not recognized as host by deblur. I haven’t heard of bbduk2, I will definitely look into that because it seems like a useful tool.
@Mehrbod_Estaki, the samples are annotated as unknown bacteria by the taxonomy. So, using the taxonomy filter hasn’t helped because there isn’t any unique identifier to filter by.
deblur just does a rough filter, pretty much to toss out anything that does not look like 16S. So mitochondrial ssu will not get filtered out. You could use exclude-seqs to filter out anything that aligns closely to mitochondrial reference sequences; it sounds like that’s the same functionality that bbduk2 would provide, based on @colinbrislawn’s description, and would keep everything in QIIME 2 (to preserve provenance).
You may want to filter out anything that is unclassifiable, anyway. So using taxonomy-based filtering should work; see the example here for filtering out anything that does not have a phylum-level label (or alternatively just use --p-exclude Unclassified)
But to answer your original question it is possible to filter out specific feature IDs. Follow the same approach described at that link for filter-samples but use filter-features instead and your metadata file will be a list of features that you want to exclude.