Filter-features not filtering features from a metadata

Mehrbod_Estaki · March 8, 2019, 7:18am

Hi again,
Picking up from my last issue. I'm having a problem where I'm trying to filter-features based on a metadata file (output of another vsearch-based filter-feature run) and nothing is being filtered.

Right from the beginning:
I export a text feature-table and a FASTA rep-seqs from R and convert both line endings with dos2unix to run these in my VM Q2.

I successfully import them into qiime:

echo -n "#OTU Table" | cat - seqtab-nochim.txt > unfiltered-table.txt

biom convert -i unfiltered-table.txt -o unfiltered-table.biom --table-type='OTU table' --to-hdf5

qiime tools import \
--input-path unfiltered-table.biom \
--type 'FeatureTable[Frequency]' \
--input-format BIOMV210Format \
--output-path unfiltered-table.qza

qiime tools import \
  --input-path rep-seqs.fna\
  --output-path rep-seq.qza \
  --type 'FeatureData[Sequence]'

unfiltered-table.qza (341.2 KB) | unfiltered-table.qzv
rep-seqs.qza (153.4 KB) | rep-seqs.qzv

The aim is to get rid of a bunch of host-contamination sequences so:

qiime quality-control exclude-seqs \
      --i-query-sequences rep-seq.qza \
      --i-reference-sequences ../88_otus.qza \
      --p-method vsearch \
      --p-perc-identity 0.65 \
      --p-perc-query-aligned 0.60 \
     --p-threads 6 \
      --o-sequence-hits hits.qza \
      --o-sequence-misses misses.qza \
      --verbose

Everything works as expected, no problem so far. The sum of features from the hits and misses file add up to the total features I initially had, great.
hits.qza (88.1 KB) | hits.qzv
misses.qza (70.6 KB) | misses.qzv

So then I go to filter the sequences in misses.qza file from my feature-table:

qiime feature-table filter-features \
  --i-table unfiltered-table.qza \
  --m-metadata-file misses.qza \
  --p-exclude-ids \
  --o-filtered-table exclude-misses-table.qza \
  --verbose

exclude-misses-table.qza (416.3 KB) | exclude-misses-table.qzv
The command runs successfully without an error but the new filtered table has not been filtered. It is the same as its unfiltered version.

Even more weird, when I try to do the reverse and filter my table and retain everything from hits.qza, everything is filtered!

qiime feature-table filter-features \
  --i-table unfiltered-table.qza \
  --m-metadata-file hits.qza \
  --o-filtered-table include-hits-table.qza \
  --verbose

include-hits-table.qza (98.9 KB) | include.hits.qzv

I've used these scripts/methods in 2018.11 no problem when I was performing everything from within qiime2. I initially thought there was still something wrong with my importing from R but everything else looks good up until this last step. So I'm not sure what I'm missing. Hope it's another simple fix I'm foolishly overlooking

timanix · March 9, 2019, 9:32am

Hi! Thank you for updating
Actually, I did the same and ended up with the same sequences in the table I wanted to filter out.
I got a lot of features that were not assigned or identified only to Bacteria level, so I wanted to filter them out. Before it I mannualy checked some of them on NCBI using blast online, and found that the majority of them are mitochondrial DNA (I performed filtering mitochondrial and chloroplasts DNA before it), and only few of them belong to new bacteria (not included in Silva database yet) or completely new.
So I tried the same steps as you described and still haven't filtered this features.
At the end I decided to download a csv file, filtered out of it all assigned features, keeping only not assigned or assigned to bacteria level and then used it to exclude this features.

thermokarst · March 11, 2019, 4:28pm

The imported files have mismatched feature IDs. The Feature IDs in the table are nucleotide strings, while the feature IDs in the rep seqs are not:

What do the source IDs look like? I am wondering if they are being mangled on import.

Mehrbod_Estaki · March 11, 2019, 9:54pm

That was totally it!
I went back and noticed that in R my rep-seqs export command was set to default settings which randomly assigns id names such as those sq1;size=9172. I simply changed the ids setting so the actual features would be used instead of randoms and everything worked perfectly from there on.

uniquesToFasta(seqtab.nochim, fout='rep-seqs.fna',ids=colnames(seqtab.nochim))

Thanks @thermokarst!