filtering sequences by sequence header

A slightly different request than a previous post of mine.

I have a .qza object of sequence data (type ‘FeatureData[AlignedSequence]’).

I’d like to remove a subset of these sequences based on their sequence identifier. This would work exactly like what @SoilRotifer has set up in step #6 of this pipeline based on this python script.

I’m trying to use as many QIIME specific features instead of external scripts. My data don’t have any feature table to work with when filtering, and I’ve struggled to find an equivalent tool in QIIME to perform the same task as the above Python script. Perhaps metadata-based filtering would work? I’m wondering what the structure of that metadata file input would look like. Maybe I can fake it with a 2 column file listing the sequences I want to keep (the list would contain only those sequences I want), and some other prefix for the SQL search to work with?

SeqID    Status
00001    keep
00002    keep
...       ...
10000    keep

The docs on metadata filtering all point to filtering samples, not sequences though. And the docs on sequence filtering use taxonomic information or a frequency table. I just want to use the simple case where I know what the sequence identifier is to include/exclude.

Thanks for any info as to whether this is possible. It seems trivial, which makes me think it must be in the documentation somewhere and I just can’t find it!

Hi @devonorourke

All you need to do is make a a file with the feature-ids you’d like to keep / or discard and run:

qiime feature-table filter-seqs \
     --i-data seqs.qza \
     --o-filtered-data seqs-filt.qza \ 
     --m-metadata-file seq-ids-to-keep.txt 

You can toggle the --p-exclude-ids / --p-no-exclude-ids depending on what is in your metadata file. Which, in this case, should look something like this:

feature-id
AY846380.1.2583
AY909584.1.2313
AY929372.1.1770
...
...

-Best
-Mike

2 Likes

Awesome.
I suppose if the list is the features I want to keep, then the switch is --p-no-exclude-ids? How is that not just named --p-include-ids?
Thanks!

In my example above, the default is False, i.e. --p-no-exclude-ids.

So, any IDs in my seq-ids-to-keep.txt are kept in my output. Use --p-exclude-ids to remove those IDs, and keep everything else.

-Mike

QIIME 2 user interfaces are generated dynamically, based on the plugin registration. So, the relevant bit of plugin code is here:

You'll see, the parameter is a boolean (True/False) named "exclude_ids".

When q2cli gets its hands on this method, it automatically generates a few parameter flags for you to use - --p-exclude-ids & --p-no-exclude-ids, to cover the True and False cases, respectively. Admittedly though, that terminology is a bit weird, and is often surprising, so about a year ago @ebolyen extended q2cli to support --p-exclude-ids True / --p-exclude-ids False, which can be a bit more clear.

Hope that explains the weird naming convention a bit.

:qiime2:

3 Likes

That’s way too thorough an explanation for my complaint :smile: - appreciate the insight.

Maybe programmers think better in the double negative then? My brain would have thought you’d have these two flags as:

--p-include-ids
--p-no-include-ids

Once you extend support to include TRUE/FALSE statements, that seems even simpler. But then again, perhaps this all comes down to what users are doing more often. Maybe they are excluding more often than including, which is what I’m guessing is the case. Which of course then makes what you’ve set up perfectly natural!

Thanks again for the reply

3 Likes

Here’s a relevant issue raised on this topic, if you wanted to be even more confused :stuck_out_tongue:

Nice!
thanks @Mehrbod_Estaki.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.