Extracting sequences identified as a certain taxa and classifying with another database

I’m trying to further classify some sequences that match to a certain taxonomic level. More specifically, I have sequences that match to Staphylococcus and I’d like to see what species these would be classified to with a different custom database.
What would be the best way to do this? I was thinking I could just filter the sequences to have those feature ids matching to this bacteria, and then export those and do as I’ve done before, but I don’t think this would give me every single occurrence of each feature… instead I’d only have the representative ones.
Maybe I could just export those rep seqs, classify them, and then apply this classification to all of the sequences in my sample (basically assign this new taxonomy to the feature-table). But I think this would require me to develop/train a classifier for this and not sure if there’s some simpler way to do this I’m not thinking of.
Any thoughts on how to do this?

1 Like

Perhaps I misunderstand your question, but all you want is the representative sequences, so that each sequence is classified once.

So yes, use filter-seqs to grab the features of interest.

No need to export if you can use your custom reference database with one of QIIME 2’s classifiers.

Yep, you can do this! See qiime feature-table merge-taxa and read carefully how input order determines priority.

Well it all depends what you want to do. If you want to use a custom reference database, fine, but what classification method will you choose? If the classify-sklearn method in q2-feature-classifier, then yes you will need to train a new classifier. But you could also use the classify-consensus-vsearch classifier if you want to, e.g., search by alignment. I might actually recommend this — if you are searching against a custom reference database (which does not contain outgroups), you may want to set a high % identity threshold, maybe even use the exact match or top hit options, to make sure you are hitting the correct species with very high precision (classify-sklearn can also increase precision by increasing the confidence parameter).

1 Like

Thanks for responding so thoroughly! Sorry it took awhile to respond, took a bit of a hiatus from this data set :slight_smile:
I think I’ll go ahead and use the classify-consensus-vsearch as you suggested and just play around with the identity/match settings to make sure I’m getting the right classifications. I’ll be sure to post again if I run into issues with it.
I appreciate the help!


Hi Nicholas,

Had a chance to work on this and am having an issue I can’t seem to get around. I am trying to get this working on a dataset I have previously worked with before I apply it to my real data. I have processed this data in a similar manner years ago in mothur, so would expect to see some differences although not drastic ones. Fortunately, the data looks similar in terms of relative abundance of staphylococcus, so I know these same communities I previously analyzed do exist with the way I have processed that data - probably not surprising, but always reassuring :slight_smile:

However when I try and use my staphylococcus reference sequences and taxonomy to classify my sequences, I’m getting almost all of them as unassigned. I’ve tried messing with the settings (–p-perc-identity, --p-maxaccepts, --p-query-cov) and have minimal changes so I think I’m just not changing the right settings or something…

I extracted rep-seqs and pulled out one of the representative sequences that matches to a species of interest (the species that should account for ~90% of the sequences) and aligned it to the corresponding sequences in the reference alignment and it does match almost entirely, with no gaps just a few bases at the beginning that don’t match and a couple hundred at the end that are in my reference sequences but not my representative sequences:

So I don’t understand why it isn’t being classified when using classify-consensus-vsearch… I did also try blast, just in case this was some weird thing.:woman_shrugging:t2:

The command I’m using is:
qiime feature-classifier classify-consensus-vsearch --i-query FB_staph_only_seqs.qza --i-reference-reads …/Staph_ref_seqs_V13.qza --i-reference-taxonomy …/Staph_species_taxonomy.qza --o-classification FB_V13_test_staph_filtered_tax.qza --verbose --p-perc-identity 0.65

And like I said, I’ve messed with various settings, this is just one example.

I did try trimming this down, which I don’t think would be the issue since you can get good classifications for regions of 16s with the full classifiers and I am getting classifications for some taxa… When I trimmed the reference alignment I still have this issue. Here’s what that taxa barplot look like, which is similar to the all others from various troubleshooting attempts:

The classifications I do have agree with my previous analysis, but obviously I’m missing a lot. Any idea where things might be going wrong?

1 Like

Hey @c.older - I have assigned this to @Nicholas_Bokulich, since he was working with you on this previously. Just a head’s up though, he is out of office for the next week or so, so it might be a while before you hear back. Thanks!

1 Like

No worries. Thanks for the update!

1 Like

This is really strange. So just to be clear: classify-sklearn is classifying these as Staph, then you are extracting those reads and attempting to classify them with your custom database to determine the nearest species?

If so, theoretically classify-sklearn can do the job for you but this is a confusing result.

Is that 49792e… sequence one of the ones that classify-consensus-vsearch is not assigning? Just want to be sure. If is is, could you please share the sequence alignment results for one of the unassigned sequences?

That’s correct, classified with classify-sklearn at Staph, extracting, and then attempting to classify those seqs with classify-consensus-vsearch (also attempted classify-consensus-blast).

I can give classify-sklearn a try next week (travelling right now) and let you know what happens.

Yes, that 49792e… is one of the unclassified sequences.

Please give that a try (if only to help us troubleshoot). Maybe also send me the files necessary to replicate this locally so that I can tinker a bit (feel free to send as a DM if you do not want to make this public). Thanks!

Thanks for sharing your files, @c.older! I was able to replicate this issue and I think I found the problem.

So your taxonomy file consists of only one level, the species designation. There is not basal taxonomy or semicolon-delimited lineage. So the vsearch classifier is reporting “unassigned” because it cannot find any consensus among the multiple hits. So let me break down what is happening:

Your query sequence 49792e… is being compared against the reference database and grabs up to N hits, where N = maxaccepts (default 10). These hits must exceed the percent-id and other matching parameters, but you are using quite permissive settings so you will just grab the 10 closest hits (and before you say that you tried higher percent-id settings let me just say: many of your query seqs probably match perfectly to > 1 species in your reference… even some of the reference sequences appear to be identical and have different annotations!).

After grabbing those N hits, q2-feature-classifier attempts to find the consensus taxonomy. So if you have 3 hits with taxonomy labels like this:
Bacteria; blah; blah; blah; Staph; Staph aureus
Bacteria; blah; blah; blah; Staph; Staph condimenti
Bacteria; blah; blah; blah; Staph; Staph aureus

The consensus taxonomy reported would be this:
Bacteria; blah; blah; blah; Staph

However, your taxonomy annotations are not semicolon-delimited and their is no basal taxonomy. Instead they look like this:
Staphylococcus aureus
Staphylococcus condimenti
Staphylococcus aureus

So q2-feature-classifier can find no consensus, since the consensus is based on the entire taxonomy lineage at each level. So the consensus reported is “Unassigned” simply because there is no consensus.

So this leaves you with a few options. In general, I would recommend using semicolon-delimited taxonomies with at least some basal taxonomy to circumvent this problem. So for example you could format your taxonomy lineages like this:
Staphylococcus; Staphylococcus aureus
Staphylococcus; Staphylococcus condimenti
Staphylococcus; Staphylococcus aureus

Though in your specific case since you are taking query seqs that were already classified at genus level as Staph and re-classifying them against an all-Staph reference database, you could just set the --p-unassignable-label option to “Unknown Staphylococcus” or something along those lines.

But I think you can do better here. I recommend reformatting the taxonomy labels as above, and then adjust your classifier options a bit:

  1. Use the --p-top-hits-only option, so that you grab only the best matches instead of N top hits.
  2. Set --p-perc-identity to a reasonably high level, e.g., 0.95

Give that a try and let me know what you find! I think with the top hits option alone you should get more classifications (even without reformatting the database), since you will be finding consensus among near-perfect matches as opposed to 10 top hits, some of which are closer than others. Some may still not classify to species level, since the query sequence may match equally well to more than one species!

Phew you had me worried for a moment there, but seeing your taxonomy annotations made this a lot clearer to me :sweat_smile:


Fixing the taxonomy file is now giving me classifications for almost all of the sequences! :tada:
Thank you so much for helping me out with this and explaining it so well. I’m so glad you were able to figure out that it was such a simple fix and not some more difficult issue :sweat_smile: