Hi @hulfred,
This sounds like an interesting approach! So you are trying to see how your samples partition based on family-level taxonomic classifications? I recommend that you use qiime taxa collapse
as you have done, then use the non-phylogenetic diversity metrics (e.g., with qiime diversity core-metrics
). This will compute bray-curtis and jaccard distances instead of UniFrac, which will effectively answer your question.
I'm sure you have good reasons for this but I personally would not go this route. There are many reasons why a sequence does not classify to family level, but just because it does not does not mean that it is necessarily a contaminant or something that should be excluded. Hence, just because you are looking at family-level differences does not mean that you should exclude this information.
These are technically annotations. See here and here for more details.
It sounds like you are aware of the problem — when you collapse your features, the family-level taxonomies become the feature IDs, which are not present in the tree. There is no way to relabel the tree with collapsed taxonomic labels in QIIME2.
When you collapse at family level, you lose a good deal of phylogenetic information, anyway, so Bray Curtis and UniFrac would likely yield pretty similar results... so my advice is just to go with Bray-Curtis instead of spending time trying to hack together a collapsed phylogeny (years ago I tried something similar to you, only to find that Bray Curtis provided very similar results!)
An alternative, instead of collapsing into explicit family-level taxonomic labels, would be to re-cluster your sequence variants at some defined level of similarity (whatever threshold would be considered family-level), align/build a phylogeny on those sequences, and use that for UniFrac. I have also done this recently — effectively to show how clustering changes as phylogenetic information is lost.
I hope that helps!