Filtering unidentified taxa from feature table

wlandesman · October 30, 2018, 4:11pm

I recently made the switch from Qiime 1 to Qiime 2 and am enjoying the format and new ways of analyzing data. In one of my datasets I have taxa identified as "k__Fungi;;;;;;" and "k__Fungi;p__Ascomycota;;;;;__" that I would like to remove from the feature table. I tried the following with no luck:

qiime taxa filter-table --i-table feature-table.qza --i-taxonomy taxonomy.qza --p-exclude "k__Fungi;;;;;;","k__Fungi;p__Ascomycota;;;;;__" --o-filtered-table output-table

I also tried spaces between semicolons, no quotes, and also tried one taxa at a time. Thanks for any help!

Bill

Nicholas_Bokulich · October 30, 2018, 5:04pm

Hi @wlandesman,
Drop the empty semicolons — these appear when you view a barplot at a specific taxonomic rank (e.g., level 7) but those are not in the underlying taxonomy file (export the data if you want to see for yourself). You probably want to do something like this instead:

qiime taxa filter-table \
    --i-table feature-table.qza \
    --i-taxonomy taxonomy.qza \
    --p-include “p__” \
    --p-exclude “p__Ascomycota” \
    --o-filtered-table output-table

That will include only seqs with phylum-level classification, unless if that is to Ascomycota. Build a barplot afterwards just to make sure you filtered everything you wanted!

Good luck!

wlandesman · October 31, 2018, 1:03pm

That did the trick - thanks! As a follow up, is there a way to get a list of the rep seq IDs for the sequence(s) placed in one of the unassigned taxa (i.e. the fungi;;;;;;)? I am curious to see if this unassigned taxa is a single or many different "OTUs". I suspect the latter but would like to confirm. I see that I can manually blast each rep seq ID but this would be a bit time consuming. Thanks again.

Bill

Nicholas_Bokulich · October 31, 2018, 1:11pm

See this tutorial. That will merge your sequences and their taxonomy classifications in a searchable visualization.

Good luck!

wlandesman · November 3, 2018, 8:58pm

Thanks for the suggestion. This leads to more questions, which I think are still relevant to this topic. Within the rep-seqs that are classified to "k__Fungi;p__Ascomycota__", but with no further taxonomic resolution, there are 185 rep seqs. Visually inspecting the entire phylogenetic tree of all data, I see that these taxa are found in different parts of the tree. Looking at what other rep-seqs are in close proximity, I see that there are rep-seqs nearby for which there is more detailed resolution, including to genus level. Thus it seems that there should be greater taxonomic resolution for these unknown p_ascomycota. Does this have something to do with how the classifier works? Any thoughts on how to handle this data?

I am reluctant to discard the sequences because a) they are abundant and b) they are driving a highly significant treatment effect. Therefore, I would like to know if these "k__;p_ascomycota__" are "real" genera/species or just some artifact of the bioinformatics that should be ignored (and then justifiably deleted). Knowing that they are true genera/species/etc would be helpful, even without knowing the detailed taxonomy.

Thank you again for your help!

Nicholas_Bokulich · November 3, 2018, 9:13pm

Are these ITS sequences? Just want to make sure because ITS is a non-coding, non-phylogenetically-informative domain, and hence the tree will not really reliably indicate phylogenetic relationships. If you are using 18S forget what I said.

Not necessarily, since ITS is not a phylogenetic marker. Yes it could have something to do with the classifier, but it is more likely due to lack of a close match in the reference database, or an issue with the query sequences themselves (e.g., too short). It will not classify as one of these "near" neighbors because (a) ITS is highly divergent so genus-level may not be close enough or (b) other aspects of the sequence (e.g., conserved flanking rRNA gene sequences) make it look similar to taxa in other parts of the tree.

Do not discard! The reference databases are far from complete, so this really could just be an unknown, or a real sequence that is difficult to classify.

I would recommend two other approaches:

try a different classifier, e.g., an alignment-based classifier like classify-consensus-vsearch to get a "second opinion".
Since I assume you have a short list of unknowns that are associated with treatment effect, just use NCBI BLAST to get a 3rd opinion... this is useful for confirming, e.g., that these are not non-target DNA (like host DNA) that is not in your reference database. I recently ran into this same issue with ITS sequences that would only classify as k__Fungi;p__Ascomycota__... BLASTn did not have a better ID, but did confirm that these were fungal, i.e., had some low-quality fungal hits and no non-fungal hits.

wlandesman · November 23, 2018, 4:34pm

Thanks for your response and sorry for the long delay in getting back to you. This is ITS data and so I probably should not be looking at the tree, as you mentioned. However, since the "p__ascomycota" are spread throughout the tree, I think this is a clue that they are different genera/species. Therefore I think I might be better off running my analysis off the feature table, as opposed to the taxonomy table.

But, as you indicated, there are better ways of approaching this, and I will try your suggestions for getting a "second opinion". Thanks again and Happy Thanksgiving!

Bill