Finding incorrectly formatted needle in apparently mostly formatted properly haystack

Hi @devonorourke,
The old and new vsearch barplots appear to have used different reference files — the reference artifacts in the repo are for the new barplot.

What I can tell so far, though, is that those bizarre assignments have annotations down to species level, e.g.:
k__GAATTAGGCCAACCAGGGGCCCTACTCGGAGATGATCAGATTTATAATGTAATTGTCACCGCTCATGCAT;p__;c__;o__;f__;g__;s__

That implies that this really is an entry in the reference taxonomy file that was used for that file, unless if something really bizarre is going on.

Maybe we can just move on, though, because it looks like you do not get those strange classifications when using the new taxonomy file. I have some thoughts about your parameter settings:

The higher perc-identity that you are using here makes more sense than 0.7. I don't know about COI though, so maybe that's appropriate if you expect many query sequences to poorly match the reference db (e.g., if the reference is not complete).

That maxaccepts setting is bewildering, though, and is sure to reduce the level of classification. This is what the classifier does:

  1. Each query sequence Q will align against each reference sequence R to generate N alignments
  2. Alignments are ranked by score, and only those > perc-identity are retained.
  3. the top maxaccepts alignments are retained (as you would expect — blast does not actually work this way, btw, and I mention that because perhaps you are aware of blast's behavior and are setting a very high maxaccepts to avoid blastish behavior. vsearch maxaccepts works intuitively)
  4. Taxonomy is assigned based on the consensus taxonomy across all retained alignments — this is performed at each rank starting at the basal level, and assigned to whatever the majority taxonomy is, if the majority is ≥ min-consensus (0.51 by default).

So if maxaccepts = 50000 and min-consensus = 0.51, and assuming perc-identity does not reduce the number of alignments, you will need to have a minimum of 50000 * 0.51= 25500 reference sequences that share the same taxonomy at level L to assign taxonomy at this level. That is more likely at basal ranks, but extremely unlikely at genus and species levels, which could be why you don't ever see any genus or species assignments.

So I would recommend dialing back that setting to a more reasonable setting.

This area is actually very ripe for optimization... it would be worth generating simulated reads or other appropriate test data to determine optimal classification settings. If you are interested, that would be possible using the same benchmarks we used to optimize these classifiers for 16S and ITS data. Let me know if you are interested and I can point you in the right direction to get started.

I hope that helps!