"All features were filtered out of the data"

#1

Hello
I am running the following code to filter features based on total frequency:

qiime feature-table filter-features --i-table 16s-merged-table.qza --p-min-frequency 3 --o-filtered-table 16s-table-gt3.qza
    qiime feature-table summarize --i-table 16s-table-gt3.qza --o-visualization 16s-table-gt3.qzv

qiime feature-table filter-seqs --i-data 16s-merged-seqs.qza  --i-table 16s-table-gt3.qza --o-filtered-data 16s-merged-seqs-gt3.qza
qiime feature-table tabulate-seqs --i-data 16s-merged-seqs-gt3.qza --o-visualization 16s-merged-seqs-gt3.qzv

The first part, where I filter features from my feature table seems to work fine, but when I try to filter my sequences using the table, I get this error:

All features were filtered out of the data.

I have checked the filtered table, and it has data.
Any ideas on what’s going on? I read this post first, but since the user is filtering based on metadata and I am not, I don’t think the solution applies.

Here is my filtered table:16s-table-gt3.qzv (2.1 MB)

Here are the sequences that I am trying to filter: 16s-merged-seqs.qzv (800.0 KB)

Thanks for your help!
Laura

(Mehrbod Estaki) #2

Hi @LauraMason,
Thanks for providing us with your code and artifacts.
What appears to be happening is exactly as the message says. You have simply filtered all of the sequences out of your data.

Breaking down this command:

qiime feature-table filter-seqs --i-data 16s-merged-seqs.qza  --i-table 16s-table-gt3.qza --o-filtered-data 16s-merged-seqs-gt3.qza

You are saying that you want to take the feature IDs from 16s-table-gt3.qza and subtract those IDs from 16s-merged-seqs.qza. When I look at your visualizations of these files, the former has 6382 sequences whereas the table you are using for filtering has 13,801. What is likely happening is that all the features are being filtered because they all exist in the table.

1 Like
#3

Hi
Yes - my plan was to remove all features with a frequency less than three from my table, and then to use the table to filter my sequences. The first step seems to work fine, but the second step does not generate the file I need: the sequences file with reads that appear three or more times. If, like you said, the reads that I am trying to filter are present in the sequence file, do I even need to do this step?

It is also odd to me (now that you point it out) that I have so few reads in these files. Any thoughts?

Thanks for your help
Laura

(Mehrbod Estaki) #4

Hi @LauraMason,
So after staring in your provenance files which resembled a jumbled spiderweb :spider_web:for far too long I think I’ve figured out what went wrong. I’ll outline the problem and then offer a solution after.

  1. You’ved imported 8 different runs and denoised them with DADA2 which gives you a feature table + rep-seqs file for each.
  2. You imported a SILVA database and used it as a positive filter to filter each rep-seqs file separately at 97% similiarity and 97% alignment (by the way in my opinion these values are way too high/strict and you can get away with much less, say 60-80%. It would also be way faster for very similar results, but that’s for a different discussion)
  3. Here is where I think the error comes in. You then take all your hits rep-seqs files and filter those from their corresponding feature-tables. This is because you have set exclude-ids paramter as true. That means all your feature tables are now left with sequences that you DON’T want. :broken_heart: I always have trouble with the exclude-ids parameter myself, it’s a bit confusing for sure.
  4. Then you merge those filter-tables and filter out features occuring less than 3 times. This table has 62+ million reads made up of 13,801 features.
  5. You then also combine all your hit rep-seqs. These are the features you actually want. There is 6382 features total in this file.
  6. So now you have a combined-table with all features you don’t want, and a combined rep-seqs that you do want. These 2 files don’t share any features, and since the default setting for filter-seqs is to retain the features found in your table you are basically left with nothing.

The good news is the solution is fairly straight forward.

  1. Merge all your unfiltered feature-tables
  2. Use this table as input in filter-features, this time using your 16s-merged-seqs.qza as the --m-metadata-file parameter.
  3. Now you can filter this new table to remove featuers appearing less than 3 times
  4. Finally, to get an exact rep-seqs match for this filtered-table (since we probably removed some rare features), simply use your 16s-merged-seqs.qza as input in filter-seqs with the new filtered-table in the --i-table slot.

You’ll know everything worked out correctly if the output of both these files has the same # of features.

This one took a bit to figure out so hopefully this works out, let us know!

4 Likes
#5

Whoa. Thanks so much! I was way off! I’ll let you know how this goes
Laura

1 Like
#6

Actually (and this may need to be a separate post) my merged 16s file seems to be missing a lot of reads. Each of the 8 lanes had nearly 4,000 reads before merging, so ~6,000 seems way too low, especially considering that the merge-seqs command does not get rid of multiples.
Any thoughts? (again, sorry if this is cross posting!)

(Matthew Ryan Dillon) #8

No worries! Cross-posting is when you ask the same question in multiple threads - doesn’t look like you have done that here.

2 Likes
(Mehrbod Estaki) #9

Hi @LauraMason,

By reads I believe you mean features? And depending on the samples being surved here this may or may not be totally fine. Keep in mind that even though the 8 runs had ~4,000 features each, much of these were identical across the runs so when you merge the rep-seqs file we don’t expect the numbers to be multiplied but rather compile only unique ones. Remember, a rep-seqs file is just the list of the unique features compiled, it holds no information about abundances or sample-sources of those features.
I also suspect you are losing a lot of potential features because of the very high values you used in your positive filter (see point 2 from the previous post). Maybe re-run one of your individual runs with lower values - you can look through the forum for some examples/recommendations - and see how much of a difference it makes. You can decide what to do based on that.

1 Like
#11

Hi!
Yes - that seemed to work. Here is the code:

qiime feature-table filter-features --i-table merged.qza --m-metadata-file 16s-merged-seqs60.tsv --o-filtered-table 16s-filtered-table60.qza
qiime feature-table summarize --i-table 16s-filtered-table60.qza --o-visualization 16s-filtered-table60.qzv

   
qiime feature-table filter-features --i-table 16s-filtered-table60.qza --p-min-frequency 3 --o-filtered-table 16s-table-gt360.qza
qiime feature-table summarize --i-table 16s-table-gt360.qza --o-visualization 16s-table-gt360.qzv

   qiime feature-table filter-seqs --i-data 16s-merged-seqs60.qza  --i-table 16s-table-gt360.qza --o-filtered-data 16s-merged-seqs-gt360.qza
 qiime feature-table tabulate-seqs --i-data 16s-merged-seqs-gt360.qza --o-visualization 16s-merged-seqs-gt360.qzv

First though, I checked the number of representative sequences obtained at mulitple levels of percent identity, and I chose 60%. I then merged all 8 lanes of 60% identity 16S rep seqs together, exported the file, converted it to a tsv and imported it again to use as metadata.

Thanks for all your help!

1 Like
(Mehrbod Estaki) #12

Hi @LauraMason,
Glad you got it sorted out and thanks for updating us.
60% seems like a good threshold, I actually use 60-65% in my own pipelines too so I’m glad you’ve come to a similar conclusion :stuck_out_tongue:
In the future you can save yourself some time by merging your files first then run the filtering step only once on the compiled table. You also don’t need to export/re-import these files, those are actually passable as metadata in the filtering actions (see point 2 from earlier)

Eitherway, glad you got it all sorted! Happy Qiimin’

2 Likes