Filtering samples by metadata, also filter req-seqs.gza

ben · June 6, 2018, 6:00pm

Quick question, in order to filter a large number of samples from a merged data table is straight forward based on sample id (filtering from the table.gza with a new metadata table).

Do we also need to filter the merged req-seq.gza sequences file? I was trying to figure this out since the it would seem that the sequences would just filter appropriately given that they're already assigned per sample?

I have consulted:

https://docs.qiime2.org/2018.4/tutorials/filtering/

thermokarst · June 7, 2018, 2:51pm

Hey there @ben:

Generally, no, but it depends on what it is you want to do with this downstream. Want to give us a more concrete example?

Thanks! :qiime2:

ben · June 7, 2018, 3:12pm

Hi Matt, forgive me, here is the example:

I have a merged table.gza and req-seqs.gza file for 7 runs. I now want to pull out only the samples from the table to work one to: a) make a tree b) more diversity metrics c) export data to phyloseq to make graphs/charts.
The table should be filtered by sample name, but the req-seq.gza file for the same samples as well? I think my main confusion is that since we merged the req-sequences together, should those be pulled out for the samples we want to work with? Because we want to generate a tree only for the samples of interest and not the merged table?

Thank you Matt. Ben

edit: How do I create a tree from a subset of samples? and this `filter-seqs` found in a feature table · Issue #152 · qiime2/q2-feature-table · GitHub

This is my main question, should I filter the rep-seqs.qza file to create a new tree for the subset of samples? Maybe I really mean to say, can I filter a rep-seqs.gza table based on sample ID and not feature ID.

edit:edit: Filtering sequences without taxonomic annotation - #5 by Nicholas_Bokulich

This is what I wanted to do! keep sequences aligned to samples in a table. Thanks.

thermokarst · June 11, 2018, 12:57pm

Hey there @ben!

Thanks for the examples, that makes a lot of sense!

I think either approach (filtering your FeatureData[Sequence] or not) makes sense to me - if you were staying within QIIME 2 that wouldn't cause any kind of mechanical issues with things like diversity metrics or taxonomic assignment, since features that aren't present in your FeatureTable[Frequency] would just be dropped from the tree or sequences. I am not 100% how this will work in phyloseq (if extra tree tips will cause a problem or not). Perhaps it is worth running things both ways for a subsample and comparing the results? The other aspect that might be worth looking into is the actual tree-building process - you will most likely see different trees, depending on the features present, or the tree building method utilized (right now the q2-phylogeny plugin uses fasttree, but @SoilRotifer added in RAxML support which should be out later this month; as well @Stefan has a q2-fragment-insertion plugin, which uses a fragment-insertion technique instead of a de novo one to build a tree). Anyway, looks like you found a few resources that will help you with the actual process of filtering FeatureData[Sequence] using a FeatureTable[Frequency] - thanks for linking to those here! Keep us posted, and let us know if you have any more questions! :qiime2: