Problem while trying to filter out unassigned taxa as well as bacteria

Since I had unassigned taxa and bacteria at the domain level (D_0_) i tried filtering these out using the command:
qiime taxa filter-table
–i-table table-demux.qza
–i-taxonomy taxonomy.qza
–p-exclude “D_0__Bacteria;;;;
–o-filtered-table table-filt.qza

However, taxa barplot still shows unassigned and D_0_Bacteria. taxa-bar-plots1.qzv (382.6 KB)

When I used “Unassigned;;;;” I got an error message saying : All features were filtered, resulting in an empty table. Prior to this, I also had other contaminants such as- D_0__Bacteria;D_1__Cyanobacteria;D_2__Oxyphotobacteria;D_3__Chloroplast;;. But I removed such unwanted seqs. using qiime taxa filter-table
–i-table table-demux.qza
–i-taxonomy taxonomy.qza
–p-mode exact
–p-exclude “Unassigned;;;;,D_0__Bacteria;;;;,D_0__Bacteria;D_1__Cyanobacteria;D_2__Oxyphotobacteria;D_3__Chloroplast;__,D_0__Bacteria;D_1__Cyanobacteria;D_2__Oxyphotobacteria;D_3__Chloroplast;D_4__Vicia faba (fava bean)”
–o-filtered-table table-filtered.qza

It seems trying to remove the

  1. unassigned gives me an error saying empty feature table
  2. D_0_bacteria does not remove this actually as seen from the taxa bar
    (however from the taxa barplot I see that on removing the unassigned should not give an empty feature table).

What else can I do, could you please let me know?

I have already trained on the V4 region using SILVA 132 database and using my forward primers (single end reads).

Also please note: i have looked into the forums for similar such problems, especially this, and my above commands were based on this forum-yet this problem remains.

I’m curious, what kind of samples are these. Ben

These are low biomass samples from insects.

Did you BLAST any of these sequences?

I primarily evaluate lung/bronchoalvolear lavage samples. They appear very similar to your samples. The unclassified ended up being eukaryotic DNA.

What I did was a strict filter with a qiime quality control step with 99% homology to Greengenes and this filtered out all non classified and bacterial DNA only classified at the kingdom level.

Please see what this post did:

Yes, I checked few of my seqs using BLAST, there was plentiful of host DNA. But I also assumed that I would be able to remove the host DNA using qiime taxa filter table. But seems this didn’t work. And I used Silva at 97% homology.

“What I did was a strict filter with a qiime quality control step with 99% homology to Greengenes”
I was not using green genes because of the not so regular updating issues. But I think I might give this a try, now that I don’t know how to proceed.

Sorry, see the previous post, they did a quality control step with 99% homology, I’m sure SILVA will work as well. You can likely do a filter for hits.qza and misses.qza. Then filter your table/req-seq with that pass. Ben

Just as a suggestion, in the post linked I used 99% database because I was much more naive back then. As it was mentioned in that link you can achieve basically the same quality with much less time by using 88_gg and reducing your identity thresholds to something like 65-85. The idea behind the positive filter is to just discard really foreign looking sequences. I doubt the choice between GG vs Silva in this step would really matter. I could be naive still though!


Yeah, I would say that this quality control step probably takes the longest. My pipeline without the step would take 3-4 hours from import to taxonomy bar plot creation on a HPC, but the quality control step with 99% homology adds 4-5 hours to that run. Ben

this is because BLAST is not particularly fast and that is all this command is doing — using BLAST to align your query sequences against the reference sequences.

indeed this is the reason to use a small database, e.g., the 88% OTUs as @Mehrbod_Estaki proposes. The goal of this step is just to do a rough filter if you are trying to filter out host vs. bacterial DNA.


This time I used just a subset of samples instead of all my samples to try getting rid of the features that said: unassigned and D_0_bacteria.
My initial plot was testtaxa-bar-plots.qzv (331.5 KB)

After following the steps with hits.qza and misses.qza as mentioned Too many unassigned or only at kingdom level features

As from the barplot, I see that many reads with unassigned taxa and bacteria were removed however, there are some that still remain. Any ideas on where can i go next with this?

Command used were:
qiime quality-control exclude-seqs
–i-query-sequences /data/p281301/test/repseqssingle.qza
–i-reference-sequences /data/p281301/test/SILVA_132_97_16S-v4-v6-ref-seqs.qza
–p-method vsearch
–p-perc-identity 0.50
–p-perc-query-aligned 0.50
–p-threads 6
–o-sequence-hits /data/p281301/test/hits1.qza
–o-sequence-misses /data/p281301/test/misses1.qza

qiime feature-table filter-features
–i-table /data/p281301/test/tabledada2single.qza
–m-metadata-file /data/p281301/test/misses1.qza
–o-filtered-table /data/p281301/test/no-miss-table-dada2.qza

qiime taxa barplot
–i-table /data/p281301/test/no-miss-table-dada2.qza
–i-taxonomy /data/p281301/test/taxonomyhits1.qza
–m-metadata-file /data/p281301/test/metadatatest.tsv
–o-visualization /data/p281301/test/taxa-barplot1-nomiss.qzv

Also, I lowered the perc-identity to 0.5 this time.


Do 0.80 for both and see if you get better filtering. Ben

