I was trying to compute alpha and beta diversities at the family level. So I first retained features that were at least annotated to the family level, and then collapsed the table into the family level. What I have done is:
The first problem I met is that the some resulting features have something like "f_" which I am not sure whether should be considered as a feature annotated to family level. If not, could you please tell me how can I filter these features out?
This sounds like an interesting approach! So you are trying to see how your samples partition based on family-level taxonomic classifications? I recommend that you use qiime taxa collapse as you have done, then use the non-phylogenetic diversity metrics (e.g., with qiime diversity core-metrics). This will compute bray-curtis and jaccard distances instead of UniFrac, which will effectively answer your question.
I’m sure you have good reasons for this but I personally would not go this route. There are many reasons why a sequence does not classify to family level, but just because it does not does not mean that it is necessarily a contaminant or something that should be excluded. Hence, just because you are looking at family-level differences does not mean that you should exclude this information.
These are technically annotations. See here and here for more details.
It sounds like you are aware of the problem — when you collapse your features, the family-level taxonomies become the feature IDs, which are not present in the tree. There is no way to relabel the tree with collapsed taxonomic labels in QIIME2.
When you collapse at family level, you lose a good deal of phylogenetic information, anyway, so Bray Curtis and UniFrac would likely yield pretty similar results… so my advice is just to go with Bray-Curtis instead of spending time trying to hack together a collapsed phylogeny (years ago I tried something similar to you, only to find that Bray Curtis provided very similar results!)
An alternative, instead of collapsing into explicit family-level taxonomic labels, would be to re-cluster your sequence variants at some defined level of similarity (whatever threshold would be considered family-level), align/build a phylogeny on those sequences, and use that for UniFrac. I have also done this recently — effectively to show how clustering changes as phylogenetic information is lost.
These percentage thresholds are fairly arbitrary and are really based on full 16S rRNA genes, not subregions, e.g., 97% is often considered genus or species level (big difference, right? we’re off to a good start! ) but some species have more than 99% similarity, others have ~95% similarity! Some subregions are more/less similar for different clades. Etc. So I’m not sure there is a good answer to this last question.
Googling dredged up this discussion, which might give you a very rough rule of thumb. The thresholds they quote are:
97% ~ Genus
94% ~ Family
88% ~ Order