"All features were filtered out of the data"

LauraMason · May 2, 2019, 5:55pm

Hello
I am running the following code to filter features based on total frequency:

qiime feature-table filter-features --i-table 16s-merged-table.qza --p-min-frequency 3 --o-filtered-table 16s-table-gt3.qza
    qiime feature-table summarize --i-table 16s-table-gt3.qza --o-visualization 16s-table-gt3.qzv

qiime feature-table filter-seqs --i-data 16s-merged-seqs.qza  --i-table 16s-table-gt3.qza --o-filtered-data 16s-merged-seqs-gt3.qza
qiime feature-table tabulate-seqs --i-data 16s-merged-seqs-gt3.qza --o-visualization 16s-merged-seqs-gt3.qzv

The first part, where I filter features from my feature table seems to work fine, but when I try to filter my sequences using the table, I get this error:

All features were filtered out of the data.

I have checked the filtered table, and it has data.
Any ideas on what's going on? I read this post first, but since the user is filtering based on metadata and I am not, I don't think the solution applies.

Here is my filtered table:16s-table-gt3.qzv (2.1 MB)

Here are the sequences that I am trying to filter: 16s-merged-seqs.qzv (800.0 KB)

Thanks for your help!
Laura

Mehrbod_Estaki · May 2, 2019, 9:45pm

Hi @LauraMason,
Thanks for providing us with your code and artifacts.
What appears to be happening is exactly as the message says. You have simply filtered all of the sequences out of your data.

Breaking down this command:

qiime feature-table filter-seqs --i-data 16s-merged-seqs.qza  --i-table 16s-table-gt3.qza --o-filtered-data 16s-merged-seqs-gt3.qza

You are saying that you want to take the feature IDs from 16s-table-gt3.qza and subtract those IDs from 16s-merged-seqs.qza. When I look at your visualizations of these files, the former has 6382 sequences whereas the table you are using for filtering has 13,801. What is likely happening is that all the features are being filtered because they all exist in the table.

LauraMason · May 5, 2019, 11:14pm

Hi
Yes - my plan was to remove all features with a frequency less than three from my table, and then to use the table to filter my sequences. The first step seems to work fine, but the second step does not generate the file I need: the sequences file with reads that appear three or more times. If, like you said, the reads that I am trying to filter are present in the sequence file, do I even need to do this step?

It is also odd to me (now that you point it out) that I have so few reads in these files. Any thoughts?

Thanks for your help
Laura

Mehrbod_Estaki · May 6, 2019, 9:21am

Hi @LauraMason,
So after staring in your provenance files which resembled a jumbled spiderweb for far too long I think I've figured out what went wrong. I'll outline the problem and then offer a solution after.

You'ved imported 8 different runs and denoised them with DADA2 which gives you a feature table + rep-seqs file for each.
You imported a SILVA database and used it as a positive filter to filter each rep-seqs file separately at 97% similiarity and 97% alignment (by the way in my opinion these values are way too high/strict and you can get away with much less, say 60-80%. It would also be way faster for very similar results, but that's for a different discussion)
Here is where I think the error comes in. You then take all your hits rep-seqs files and filter those from their corresponding feature-tables. This is because you have set exclude-ids paramter as true. That means all your feature tables are now left with sequences that you DON'T want. I always have trouble with the exclude-ids parameter myself, it's a bit confusing for sure.
Then you merge those filter-tables and filter out features occuring less than 3 times. This table has 62+ million reads made up of 13,801 features.
You then also combine all your hit rep-seqs. These are the features you actually want. There is 6382 features total in this file.
So now you have a combined-table with all features you don't want, and a combined rep-seqs that you do want. These 2 files don't share any features, and since the default setting for filter-seqs is to retain the features found in your table you are basically left with nothing.

The good news is the solution is fairly straight forward.

Merge all your unfiltered feature-tables
Use this table as input in filter-features, this time using your 16s-merged-seqs.qza as the --m-metadata-file parameter.
Now you can filter this new table to remove featuers appearing less than 3 times
Finally, to get an exact rep-seqs match for this filtered-table (since we probably removed some rare features), simply use your 16s-merged-seqs.qza as input in filter-seqs with the new filtered-table in the --i-table slot.

You'll know everything worked out correctly if the output of both these files has the same # of features.

This one took a bit to figure out so hopefully this works out, let us know!

LauraMason · May 6, 2019, 1:59pm

Whoa. Thanks so much! I was way off! I'll let you know how this goes
Laura

LauraMason · May 6, 2019, 2:18pm

Actually (and this may need to be a separate post) my merged 16s file seems to be missing a lot of reads. Each of the 8 lanes had nearly 4,000 reads before merging, so ~6,000 seems way too low, especially considering that the merge-seqs command does not get rid of multiples.
Any thoughts? (again, sorry if this is cross posting!)

thermokarst · May 6, 2019, 3:23pm

No worries! Cross-posting is when you ask the same question in multiple threads - doesn't look like you have done that here.

Mehrbod_Estaki · May 6, 2019, 6:38pm

Hi @LauraMason,

By reads I believe you mean features? And depending on the samples being surved here this may or may not be totally fine. Keep in mind that even though the 8 runs had ~4,000 features each, much of these were identical across the runs so when you merge the rep-seqs file we don't expect the numbers to be multiplied but rather compile only unique ones. Remember, a rep-seqs file is just the list of the unique features compiled, it holds no information about abundances or sample-sources of those features.
I also suspect you are losing a lot of potential features because of the very high values you used in your positive filter (see point 2 from the previous post). Maybe re-run one of your individual runs with lower values - you can look through the forum for some examples/recommendations - and see how much of a difference it makes. You can decide what to do based on that.

LauraMason · May 10, 2019, 6:25pm

Hi!
Yes - that seemed to work. Here is the code:

qiime feature-table filter-features --i-table merged.qza --m-metadata-file 16s-merged-seqs60.tsv --o-filtered-table 16s-filtered-table60.qza
qiime feature-table summarize --i-table 16s-filtered-table60.qza --o-visualization 16s-filtered-table60.qzv

   
qiime feature-table filter-features --i-table 16s-filtered-table60.qza --p-min-frequency 3 --o-filtered-table 16s-table-gt360.qza
qiime feature-table summarize --i-table 16s-table-gt360.qza --o-visualization 16s-table-gt360.qzv

   qiime feature-table filter-seqs --i-data 16s-merged-seqs60.qza  --i-table 16s-table-gt360.qza --o-filtered-data 16s-merged-seqs-gt360.qza
 qiime feature-table tabulate-seqs --i-data 16s-merged-seqs-gt360.qza --o-visualization 16s-merged-seqs-gt360.qzv

First though, I checked the number of representative sequences obtained at mulitple levels of percent identity, and I chose 60%. I then merged all 8 lanes of 60% identity 16S rep seqs together, exported the file, converted it to a tsv and imported it again to use as metadata.

Thanks for all your help!

Mehrbod_Estaki · May 10, 2019, 7:27pm

Hi @LauraMason,
Glad you got it sorted out and thanks for updating us.
60% seems like a good threshold, I actually use 60-65% in my own pipelines too so I'm glad you've come to a similar conclusion
In the future you can save yourself some time by merging your files first then run the filtering step only once on the compiled table. You also don't need to export/re-import these files, those are actually passable as metadata in the filtering actions (see point 2 from earlier)

Eitherway, glad you got it all sorted! Happy Qiimin'

LauraMason · May 29, 2019, 3:08pm

Hi again!
Really strangely running into this issue again when I am repeating the same procedure with a 99% OTU data set. I am repeating the same code exactly after positive filtering for 16S sequences using the SILVA 99% OTUs database. Any thoughts?

Mehrbod_Estaki · May 29, 2019, 7:09pm

Hi @LauraMason,
Could you share with us the final .qza you are using. Did you double-check to make sure the exclude-ids is not set to true again in your filtering? Also, are you using the 99% SILVA database as a positive filter? That is unnecessary too high for a positive filter. What are the % identity and % alignment parameters? If those are too high as well you may be just losing all your reads because they are failing to hit the database.
But we'll need the exact commands, errors and the .qza to re-troubleshoot.

LauraMason · May 29, 2019, 7:48pm

First, I used the SILVA 132 16S 99% database with 60% identity @ 97% alignment to split the entire data set into 16S and "not" 16S, and I merged the outputs of each set together. It would have been easier to merge first and then split into 16S and not 16S, but I did not think of that at the time.

    qiime quality-control exclude-seqs --i-query-sequences Lane_1/rep-seqs-dada2.qza --i-reference-sequences silva_16s_ref99.qza --p-method blast --p-perc-identity 0.60 --p-perc-query-aligned 0.97 --o-sequence-hits Lane_1/16s-rep-seqs60_99.qza --o-sequence-misses Lane_1/not16S-rep-seqs60_99.qza 
    qiime feature-table tabulate-seqs --i-data Lane_1/16s-rep-seqs60_99.qza --o-visualization Lane_1/16s-rep-seqs60_99.qzv 
    qiime feature-table tabulate-seqs --i-data Lane_1/not16s-rep-seqs60_99.qza --o-visualization Lane_1/not16s-rep-seqs60_99.qzv

    qiime quality-control exclude-seqs --i-query-sequences Lane_3/rep-seqs-dada2.qza --i-reference-sequences silva_16s_ref99.qza --p-method blast --p-perc-identity 0.60 --p-perc-query-aligned 0.97 --o-sequence-hits Lane_3/16s-rep-seqs60_99.qza --o-sequence-misses Lane_3/not16S-rep-seqs60_99.qza 
    qiime feature-table tabulate-seqs --i-data Lane_3/16s-rep-seqs60_99.qza --o-visualization Lane_3/16s-rep-seqs60_99.qzv 
    qiime feature-table tabulate-seqs --i-data Lane_3/not16s-rep-seqs60_99.qza --o-visualization Lane_3/not16s-rep-seqs60_99.qzv 

    qiime quality-control exclude-seqs --i-query-sequences Lane_4/rep-seqs-dada2.qza --i-reference-sequences silva_16s_ref99.qza --p-method blast --p-perc-identity 0.60 --p-perc-query-aligned 0.97 --o-sequence-hits Lane_4/16s-rep-seqs60_99.qza --o-sequence-misses Lane_4/not16S-rep-seqs60_99.qza 
    qiime feature-table tabulate-seqs --i-data Lane_4/16s-rep-seqs60_99.qza --o-visualization Lane_4/16s-rep-seqs60_99.qzv 
    qiime feature-table tabulate-seqs --i-data Lane_4/not16s-rep-seqs60_99.qza --o-visualization Lane_4/not16s-rep-seqs60_99.qzv 

    qiime quality-control exclude-seqs --i-query-sequences Lane_5/rep-seqs-dada2.qza --i-reference-sequences silva_16s_ref99.qza --p-method blast --p-perc-identity 0.60 --p-perc-query-aligned 0.97 --o-sequence-hits Lane_5/16s-rep-seqs60_99.qza --o-sequence-misses Lane_5/not16S-rep-seqs60_99.qza 
    qiime feature-table tabulate-seqs --i-data Lane_5/16s-rep-seqs60_99.qza --o-visualization Lane_5/16s-rep-seqs60_99.qzv 
    qiime feature-table tabulate-seqs --i-data Lane_5/not16s-rep-seqs60_99.qza --o-visualization Lane_5/not16s-rep-seqs60_99.qzv 

    qiime quality-control exclude-seqs --i-query-sequences Lane_6/rep-seqs-dada2.qza --i-reference-sequences silva_16s_ref99.qza --p-method blast --p-perc-identity 0.60 --p-perc-query-aligned 0.97 --o-sequence-hits Lane_6/16s-rep-seqs60_99.qza --o-sequence-misses Lane_6/not16S-rep-seqs60_99.qza 
    qiime feature-table tabulate-seqs --i-data Lane_6/16s-rep-seqs60_99.qza --o-visualization Lane_6/16s-rep-seqs60_99.qzv 
    qiime feature-table tabulate-seqs --i-data Lane_6/not16s-rep-seqs60_99.qza --o-visualization Lane_6/not16s-rep-seqs60_99.qzv 

    qiime quality-control exclude-seqs --i-query-sequences Lane_7/rep-seqs-dada2.qza --i-reference-sequences silva_16s_ref99.qza --p-method blast --p-perc-identity 0.60 --p-perc-query-aligned 0.97 --o-sequence-hits Lane_7/16s-rep-seqs60_99.qza --o-sequence-misses Lane_7/not16S-rep-seqs60_99.qza 
    qiime feature-table tabulate-seqs --i-data Lane_7/16s-rep-seqs60_99.qza --o-visualization Lane_7/16s-rep-seqs60_99.qzv 
    qiime feature-table tabulate-seqs --i-data Lane_7/not16s-rep-seqs60_99.qza --o-visualization Lane_7/not16s-rep-seqs60_99.qzv 

    qiime quality-control exclude-seqs --i-query-sequences Lane_8/rep-seqs-dada2.qza --i-reference-sequences silva_16s_ref99.qza --p-method blast --p-perc-identity 0.60 --p-perc-query-aligned 0.97 --o-sequence-hits Lane_8/16s-rep-seqs60_99.qza --o-sequence-misses Lane_8/not16S-rep-seqs60_99.qza 
    qiime feature-table tabulate-seqs --i-data Lane_8/16s-rep-seqs60_99.qza --o-visualization Lane_8/16s-rep-seqs60_99.qzv 
    qiime feature-table tabulate-seqs --i-data Lane_8/not16s-rep-seqs60_99.qza --o-visualization Lane_8/not16s-rep-seqs60_99.qzv  

    qiime feature-table merge-seqs --i-data Lane_1/16s-rep-seqs60_99.qza --i-data Lane_2/16s-rep-seqs60_99.qza --i-data Lane_3/16s-rep-seqs60_99.qza --i-data Lane_4/16s-rep-seqs60_99.qza --i-data Lane_5/16s-rep-seqs60_99.qza --i-data Lane_6/16s-rep-seqs60_99.qza --i-data Lane_7/16s-rep-seqs60_99.qza --i-data Lane_8/16s-rep-seqs60_99.qza  --o-merged-data 16s-merged-seqs60_99.qza
    qiime feature-table tabulate-seqs --i-data 16s-merged-seqs60_99.qza --o-visualization 16s-merged-seqs60_99.qzv

    qiime feature-table merge-seqs --i-data Lane_1/not16S-rep-seqs60_99.qza --i-data Lane_2/not16S-rep-seqs60_99.qza --i-data Lane_3/not16S-rep-seqs60_99.qza --i-data Lane_4/not16S-rep-seqs60_99.qza --i-data Lane_5/not16S-rep-seqs60_99.qza --i-data Lane_6/not16S-rep-seqs60_99.qza --i-data Lane_7/not16S-rep-seqs60_99.qza --i-data 
Lane_8/not16S-rep-seqs60_99.qza  --o-merged-data not16s-merged-seqs60_99.qza
    qiime feature-table tabulate-seqs --i-data not16s-merged-seqs60_99.qza --o-visualization not16s-merged-seqs60_99.qzv

This gave me these files
16s-merged-seqs60_99.qzv (1.2 MB)
not16s-merged-seqs60_99.qzv (1.1 MB)

Then I converted these files into tsv metadata files, and used these as the filter for my merged tables from dada2.

#16s
qiime tools export --input-path 16S-merged-seqs60_99.qza --output-path sequences_exported_99/
#this gives a fasta file that must be converted into a tvs with proper headings (#OTUID; Sequences)
qiime feature-table filter-features --i-table merged-table.qza --m-metadata-file 16s-merged-seqs60_99.tsv --o-filtered-table 16s-filtered-table60_99.qza
qiime feature-table summarize --i-table 16s-filtered-table60_99.qza --o-visualization 16s-filtered-table60_99.qzv


#not 16S
qiime tools export --input-path not16s-merged-seqs60_99.qza --output-path sequences_exported_ITS_99/
#this gives a fasta file that must be converted into a tvs with proper headings (#OTUID; Sequences)
qiime feature-table filter-features --i-table merged.qza --m-metadata-file not16s-merged-seqs60_99.tsv --o-filtered-table not16s-filtered-table60_99.qza
qiime feature-table summarize --i-table not16s-filtered-table60_99.qza --o-visualization not16s-filtered-table60_99.qzv

This is where something went wrong - I have no reads left!
not16s-filtered-table60_99.qzv (724.9 KB)
16s-filtered-table60_99.qzv (758.1 KB)

This is the exact same code that I ran the first time, and the only difference is that I used the 99% reference database from SILVA. Am I just missing something?

Thanks in advance!
Laura

Mehrbod_Estaki · May 30, 2019, 11:54pm

Hi @LauraMason,
I'm not too sure what is going on here yet. Can you try running filter-features without the converting to a .tsv, changing headings etc step? As I mentioned before you don't actually need to do this step as the --m-metadada parameter in filter-features will accept your your hits.qza and misses.qza sequence files as is. I'm wondering if something is going wrong in that process. If that doesn't work would you be willing to share your merged table and those misses/hits artifacts? You can DM those to me if you'd rather not post them publicly.

LauraMason · June 3, 2019, 1:37pm

Hi
I think that worked - thanks for your help! The provenance file looks kind of wild, but the tables and sequence files have the same number of features.
Any ideas why using the .qza file worked for this and the .tsv worked for my original 97% pass?
As always,
I appreciate the help!

not16S-merged-seqs-2gt360_99_June3.qzv (1.5 MB)
not16s-merged-seqs-gt360_June3.qzv (1.8 MB)
16s-table-merged_gt3_June3.qzv (1.4 MB)
16s-merged-seqs-gt360_99b_June3.qzv (1.6 MB)

thermokarst · June 3, 2019, 1:46pm

@LauraMason, can you provide an example of your converted TSV files?