Problem while trying to filter out unassigned taxa as well as bacteria

microme · September 30, 2019, 12:52am

Since I had unassigned taxa and bacteria at the domain level (D_0_) i tried filtering these out using the command:
qiime taxa filter-table
--i-table table-demux.qza
--i-taxonomy taxonomy.qza
--p-exclude "D_0__Bacteria;;;;"
--o-filtered-table table-filt.qza

However, taxa barplot still shows unassigned and D_0_Bacteria. taxa-bar-plots1.qzv (382.6 KB)

When I used "Unassigned;;;;" I got an error message saying : All features were filtered, resulting in an empty table. Prior to this, I also had other contaminants such as- D_0__Bacteria;D_1__Cyanobacteria;D_2__Oxyphotobacteria;D_3__Chloroplast;;. But I removed such unwanted seqs. using qiime taxa filter-table
--i-table table-demux.qza
--i-taxonomy taxonomy.qza
--p-mode exact
--p-exclude "Unassigned;;;;,D_0__Bacteria;;;;,D_0__Bacteria;D_1__Cyanobacteria;D_2__Oxyphotobacteria;D_3__Chloroplast;__,D_0__Bacteria;D_1__Cyanobacteria;D_2__Oxyphotobacteria;D_3__Chloroplast;D_4__Vicia faba (fava bean)"
--o-filtered-table table-filtered.qza

It seems trying to remove the

unassigned gives me an error saying empty feature table
D_0_bacteria does not remove this actually as seen from the taxa bar
(however from the taxa barplot I see that on removing the unassigned should not give an empty feature table).

What else can I do, could you please let me know?

I have already trained on the V4 region using SILVA 132 database and using my forward primers (single end reads).

Also please note: i have looked into the forums for similar such problems, especially this, and my above commands were based on this forum-yet this problem remains.

ben · September 30, 2019, 1:05am

I'm curious, what kind of samples are these. Ben

microme · September 30, 2019, 1:19am

These are low biomass samples from insects.

ben · September 30, 2019, 1:21am

Did you BLAST any of these sequences?

I primarily evaluate lung/bronchoalvolear lavage samples. They appear very similar to your samples. The unclassified ended up being eukaryotic DNA.

What I did was a strict filter with a qiime quality control step with 99% homology to Greengenes and this filtered out all non classified and bacterial DNA only classified at the kingdom level.

Please see what this post did:

microme · September 30, 2019, 1:28am

Yes, I checked few of my seqs using BLAST, there was plentiful of host DNA. But I also assumed that I would be able to remove the host DNA using qiime taxa filter table. But seems this didn't work. And I used Silva at 97% homology.

"What I did was a strict filter with a qiime quality control step with 99% homology to Greengenes"
I was not using green genes because of the not so regular updating issues. But I think I might give this a try, now that I don't know how to proceed.

ben · September 30, 2019, 1:29am

Sorry, see the previous post, they did a quality control step with 99% homology, I'm sure SILVA will work as well. You can likely do a filter for hits.qza and misses.qza. Then filter your table/req-seq with that pass. Ben

Mehrbod_Estaki · September 30, 2019, 2:04am

Just as a suggestion, in the post linked I used 99% database because I was much more naive back then. As it was mentioned in that link you can achieve basically the same quality with much less time by using 88_gg and reducing your identity thresholds to something like 65-85. The idea behind the positive filter is to just discard really foreign looking sequences. I doubt the choice between GG vs Silva in this step would really matter. I could be naive still though!

ben · September 30, 2019, 2:08am

Yeah, I would say that this quality control step probably takes the longest. My pipeline without the step would take 3-4 hours from import to taxonomy bar plot creation on a HPC, but the quality control step with 99% homology adds 4-5 hours to that run. Ben

Nicholas_Bokulich · September 30, 2019, 2:41am

this is because BLAST is not particularly fast and that is all this command is doing — using BLAST to align your query sequences against the reference sequences.

indeed this is the reason to use a small database, e.g., the 88% OTUs as @Mehrbod_Estaki proposes. The goal of this step is just to do a rough filter if you are trying to filter out host vs. bacterial DNA.

ben · October 7, 2019, 9:00pm

Do 0.80 for both and see if you get better filtering. Ben

microme · October 21, 2019, 7:00pm

Update and edit: [For the moderators: Please don't queue this. I was trying to edit this text but figured I had to either reply or delete this entire message. Out of the 2 i thought it was better to edit by replying by then by deleting the same previous post. I did this as I recently realized that i had included few of my user-id details in this deleted post, hence i have deleted those information. So now, the commands remain the same as that in the previous deleted post, with only removed personal ids. Apologies for the queueing notification.]

This time I used just a subset of samples instead of all my samples to try getting rid of the features that said: unassigned and D_0_bacteria.
My initial plot was testtaxa-bar-plots.qzv (331.5 KB)

After following the steps with hits.qza and misses.qza as mentioned Too many unassigned or only at kingdom level features

As from the barplot, I see that many reads with unassigned taxa and bacteria were removed however, there are some that still remain. Any ideas on where can i go next with this?

Command used were:
qiime quality-control exclude-seqs
–i-query-sequences /test/repseqssingle.qza
–i-reference-sequences /test/SILVA_132_97_16S-v4-v6-ref-seqs.qza
–p-method vsearch
–p-perc-identity 0.50
–p-perc-query-aligned 0.50
–p-threads 6
–o-sequence-hits /test/hits1.qza
–o-sequence-misses /test/misses1.qza
–verbose

qiime feature-table filter-features
–i-table /test/tabledada2single.qza
–m-metadata-file /test/misses1.qza
–o-filtered-table /test/no-miss-table-dada2.qza
–p-exclude-ids

qiime taxa barplot
–i-table /test/no-miss-table-dada2.qza
–i-taxonomy /test/taxonomyhits1.qza
–m-metadata-file /test/metadatatest.tsv
–o-visualization /test/taxa-barplot1-nomiss.qzv

Also, I lowered the perc-identity to 0.5 this time

Nicholas_Bokulich · October 21, 2019, 7:06pm

Just for clarification were you able to resolve the issue using @ben's recommendation?

microme · October 21, 2019, 7:13pm

Nope! not with the silva. Acc. to a discussion with my colleague, I have been using RDP with an identity of 0.50. this problem seems to go away.The rest I am still trying to figure it out. Although I must mention that now i get archaea and there is no such situation with D_0_ or unclassified-then on using qiime taxa filter-table, removed them.

I am still trying to figure out the rest, incase i discover more, i will post here.

Nicholas_Bokulich · October 21, 2019, 7:15pm

Oh I see, so perhaps the RDP database you had been using was bacteria only, or at least missing the Archaea that you have in your samples. Please let us know when you have final resolution, or run into any more issues.

system · November 22, 2019, 1:15am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.