Filtering FeatureData[Sequence] using metadata file

Hi there, I'm interested in this question too!

When using the filter-seqs command, where do we get the information to put in the metadata file of sequences to exclude?

I'm guessing the process of filtering out sequences found in blank samples would start off like this:

#Create a sequences artifact with only the blank samples
qiime feature-table filter-seqs --i-data rep_seqs.qza --m-metadata-file map.txt --p-where "Control='Y'" --o-filtered-data control_seqs.qza

And then once you have your artifact of sequences from the control samples, you could tabulate the sequences to get them into a format you can view ...

qiime feature-table tabulate-seqs --i-data control_seqs.qza --o-visualization control_seqs.qzv

And then you could open your .qzv with qiime tools view and see something kind of like this:

What's the equivalent of downloading the frequency per feature .csv in @thermokarst's recommended workflow above? Would you download the fasta file, save it as a TSV and use that as your metadata file for removing the 'blank' sequences from the original rep seqs file?

Thanks! :slight_smile:

Hi @Matilda_H-D,
Thanks for posting your question! The metadata input to filter-seqs can only consist of feature metadata, e.g., a sequence file or taxonomy file. We do not yet have functionality for removing sequences that are found in a specific sample in a single command (multi-step details are below).

We plan to add tutorials describing this process in this month’s release of QIIME 2 so check back then for more details.

For now, this forum post still describes the best workaround for removing features from a feature table that are detected in a specific sample. What filter-features allows us to do is to also filter our sequences file by passing in features-to-filter.tsv (see the forum post for how that file is generated) as metadata.

Note, however, that removing all sequences found in a blank may not be a good approach; many of these sequences may in fact be cross-contaminants rather than exogenous contaminants and removing them could eliminate valid features from other samples.

In the future we plan to add methods for contaminant detection that more directly address this issue.

I hope that answers your questions! Please let us know if you have additional questions/concerns.

2 Likes

Hi @Nicholas_Bokulich

Thank you for your reply!

What filter-features allows us to do is to also filter our sequences file by passing in features-to-filter.tsv (see the forum post for how that file is generated) as metadata.

So do I understand correctly, when you use the filter-table filter-features command, both the feature table and the representative sequences file will be filtered according to the metadata file that is passed? I had another look at the filter-features documentation page and it only mentions a filtered table as output, not a filtered sequences file.

What I would be interested in was what Fernando Stuart mentioned in this post -- filtering the rep seqs file in order to build a phylogenetic tree containing only the sequences that remain after filtering out those found in lab controls.

Looking at the feature-table filter-seqs documentation, it seems like maybe this could be achieved using this command and passing in the same metadata file (i.e. list of features, not of actual sequences, to exclude) that would be used in filter-features? Is that right?

Thanks for your help!

No — filter-features will only remove those features from the feature table, not from the sequences file.

Correct. The process of generating a list of features that you can pass as metadata is mentioned further down in the same thread. The features-to-filter.tsv file described in that thread (the same list of features to filter from the feature table) would be passed to filter-seqs to remove those same features (sequences) from the FeatureData[Sequence] artifact that you have using the following command:

qiime feature-table filter-seqs \
    --i-data rep_seqs.qza \
    --m-metadata-file features-to-filter.tsv \
    --o-filtered-data control_seqs.qza

Generating that features-to-filter.tsv is described in this post (and in the future we may support a more direct method for generating such a file that contains features found in a single sample or collection of samples; I have raised an issue here to track progress)

Please let me know if that answers your question!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.