Hi @aeriel.belk,
This problem is a good reason to keep things as ASVs in most cases (as they are usually more informative) though as you say you have found that species are better predictors here.
All in all, I think your idea #2 is probably the better way to go about this, and I have given a QIIME 2 solution below.
That's a quirk of greengenes naming conventions, and these really do represent different things. They should not really be collapsed into one.
More of the issue you have is that all ASVs only classified to order level will be lumped, e.g., as k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales; ; ;__
, when they may represent very diverse lineages (and classify as such if only you had a little more information...).
QIIME 2 is actually classifying this one to species level: k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__;g__;s__
, but that is how the closest matches are actually annotated in greengenes: these are OTUs that represent diverse lineages but very similar 16S sequences. So this is probably distinct enough from the order-level classified ASVs that it should be kept separate.
The one is classified at species level (but the species annotation is missing). The other cannot be classified at that level and it is your job to remove it if that is really what you want to do.
Sure, qiime taxa filter-table
will do this for you — see the example in the qiime2.org filtering tutorial (which shows phylum level, just include s__
instead of p__
). This will filter out the underclassified but it will not filter out the classified-to-species-level-but-unannotated-in-greengenes (e.g., k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__;g__;s__
). If you want to remove that you can do a second filter step with the same command to exclude g__;s__
.
It is worth a try but I am not sure it will really accomplish what you want. It probably wouldn't bias the results since you are not really adding information — you are removing it! So worse comes to worse I think you will just damage classification accuracy since you are losing features. Those features probably are not important anyway, since they presumably may represent diverse lineages (as it sounds like you are assuming), so it is worth a try...
No, but to accomplish more or less the same thing you could just cluster your ASVs into OTUs in QIIME 2!