Extracting sequences identified as a certain taxa and classifying with another database

Nicholas_Bokulich · November 14, 2019, 6:18pm

Thanks for sharing your files, @c.older! I was able to replicate this issue and I think I found the problem.

So your taxonomy file consists of only one level, the species designation. There is not basal taxonomy or semicolon-delimited lineage. So the vsearch classifier is reporting "unassigned" because it cannot find any consensus among the multiple hits. So let me break down what is happening:

Your query sequence 49792e... is being compared against the reference database and grabs up to N hits, where N = maxaccepts (default 10). These hits must exceed the percent-id and other matching parameters, but you are using quite permissive settings so you will just grab the 10 closest hits (and before you say that you tried higher percent-id settings let me just say: many of your query seqs probably match perfectly to > 1 species in your reference... even some of the reference sequences appear to be identical and have different annotations!).

After grabbing those N hits, q2-feature-classifier attempts to find the consensus taxonomy. So if you have 3 hits with taxonomy labels like this:
Bacteria; blah; blah; blah; Staph; Staph aureus
Bacteria; blah; blah; blah; Staph; Staph condimenti
Bacteria; blah; blah; blah; Staph; Staph aureus

The consensus taxonomy reported would be this:
Bacteria; blah; blah; blah; Staph

However, your taxonomy annotations are not semicolon-delimited and their is no basal taxonomy. Instead they look like this:
Staphylococcus aureus
Staphylococcus condimenti
Staphylococcus aureus

So q2-feature-classifier can find no consensus, since the consensus is based on the entire taxonomy lineage at each level. So the consensus reported is "Unassigned" simply because there is no consensus.

So this leaves you with a few options. In general, I would recommend using semicolon-delimited taxonomies with at least some basal taxonomy to circumvent this problem. So for example you could format your taxonomy lineages like this:
Staphylococcus; Staphylococcus aureus
Staphylococcus; Staphylococcus condimenti
Staphylococcus; Staphylococcus aureus

Though in your specific case since you are taking query seqs that were already classified at genus level as Staph and re-classifying them against an all-Staph reference database, you could just set the --p-unassignable-label option to "Unknown Staphylococcus" or something along those lines.

But I think you can do better here. I recommend reformatting the taxonomy labels as above, and then adjust your classifier options a bit:

Use the --p-top-hits-only option, so that you grab only the best matches instead of N top hits.
Set --p-perc-identity to a reasonably high level, e.g., 0.95

Give that a try and let me know what you find! I think with the top hits option alone you should get more classifications (even without reformatting the database), since you will be finding consensus among near-perfect matches as opposed to 10 top hits, some of which are closer than others. Some may still not classify to species level, since the query sequence may match equally well to more than one species!

Phew you had me worried for a moment there, but seeing your taxonomy annotations made this a lot clearer to me