Notice how seq1 has some taxa assigned from phylum through to species, but seq2 is lacking species-level information, and seq3 is missing family, genus, and species-rank information?
It seems like the taxonomy-based filtering described here is the tool I’m looking for, but I wanted to confirm the appropriate syntax. The examples used therein describe how to include for a taxonomic rank (--p-include p_), and exclude when a sample exists but wants to be discarded (--p-exlcude chlorploast).
In my case, I want to drop from something using a prefix that isn’t distinct. I can’t just use --p-include s__ because that keeps everything, right? Is there support for a wildcard perhaps? Something like --p-exclude s__$? I’d probably need to be careful and also build in some kind of trailing white space bit too…
i don’t have an answer for you, just a philosophical question about your database. When I see your seq2 annotation, it suggests to me that your database (as formatteed) doesn’t have a species level assignment for that organism. I know you’re doing COI and I dont know the completeness of the database, but I’d be curious about how many are there.
As you describe, a sequence can get assigned by a reference that was incomplete in the first place. To your question, yes, I have created COI databases where species information is lacking, so this is absolutely one possibility.
However, another situation can arise even if you filter that database ahead of time and require all references to have species information. This occurs when you apply an LCA consensus approach to the assigned taxonomy. If a sequence has two top hits with different species names, the resulting classification will be ambiguous.
In either case I want to filter these records out. While my example was using species-rank filtering, I’m actually hoping to apply this to records that lack family-rank information. I just gave the species-level example as it seemed simpler.
For what purpose? Are you hoping to measure taxonomic completeness? or do you want to count the number of features missing family level? or do you want to wash those unclassified OTUs right out of your hair?
you could probably do something like `--p-exclude 'f__;g__' if the taxonomies do not terminate at the rank where they are unclassified (as I believe is the case in your LCA example)
Immediate purpose is to filter out features missing family level. After looking at this kind of data for a while, I've yet to come across an instance where an ASV or OTU feature was lacking at least family-level information and was itself highly abundant or occurred in repeated samples. I could filter using frequency(ofdetection)-based or abundance(ofsequence)-based thresholds, and likely would filter out a similar grouping of features, but my motivation in looking at these samples was to retain features that I have better certainty of their taxonomic identity.
The secondary purpose was to find out if wildcard filtering was even a thing. I didn't see anything in the filtering documentation that mentioned it, but perhaps that easter egg is in there some where... I think your example will certainly work in my specific use case, but I'm still unclear how I might filter out species-level ambiguous taxa. Not a problem for me right now, but just wanted to raise that question.
I did a bit of digging and I think yes, certain wildcard expressions are allowed, so you could do something like --p-include s__? to include only features with a species label (in this case I think ? will match any single character, whereas * will match zero or more so s__* will still match empty species).
q2-taxa is using an sqlite query under the hood:
So it looks like any sqlite wildcard other than _ can be used for pattern matching.
So I think I was mixing my my LIKEs and GLOBs when suggesting this:
Since filter-seqs (a) uses LIKE and (b) escapes _ characters, I think the short answer to this question:
is that filter-seqs does have wildcard filtering, but it is pretty limited since you can only use the % wildcard. That will not work for your unclassified species example.
So maybe RESCRIPt can help you out here... it has a filter-taxa method that uses regular expressions (including the familiar ? and * wildcards) for including/excluding taxa. You can use that method to filter your taxonomy, then filter your sequences with:
to only include seqs that are found in the taxonomy file.
To be honest, I have not used wildcards for filtering with that method yet — it was designed and tested for specific substring filtering — but as far as I can tell wildcards are allowed, so this may be an opportunity to chase a few bugs out of the woodwork and tighten up regex filtering!