Hi @alicias1,
You have a lot going on! Its sounds like a super cool study. I think my favorite (and its relatively old) paper on Heritability and the microbiome is Human Genetics Shape the Gut Microbiome. However, it's a 2014 publication and we've figured out more about the microbiome since. So, it might be a starting place. I'd recommend balancing it with Microbiome Datasets Are Compositional: And This Is Not Optional, the DOI of which will probably be on my wedding announcement because of how often I cite it. I would also recommend your own search of the literature because I know there are a lot of heritability papers that I'm missing.
I'm also going to recommend straight away that you look into qiime2R for all your artifact to phyloseq needs.
With that caveat, let me see if I can try and answer your questions.
I agree with this completely! But, I think it depends on two factors. First, how do you plan to model, and second, how many features do you lose? I tend to work in the gut, where I like to filter to present in at least 10% of my samples. I do this because I sometimes to run prevalence (presence/absense) based models and those are happy with a 90/10 split but throw a fit when I try to push it to 95/5. Sparsity is an important feature fo the microbiome and so I think you should keep even somewhat sparse features. At the same time, a feature present in a single individual is insuffeciently powered to allow analysis.
I'm going to refer you to Microbiome Datasets are Compositional again. DeSeq2 doesn't perform well against compositional models, either. I don't know if your heritability can be modeled using something like an OLS (it's been ages since I did ACE calculations), but if it can be, then songbird might be a good option to run.
This is so some degree a personal taste thing. I like feature and maybe genus. I abhor species level descriptions in amplicon data because species-level annotation is funky and Im making the political stand here that we should abandon species for amplicon IDs. I also don't htink you get much out of something above genus to family level. Collapsing makes the data more tractable, but my dog (), my cat (), and my theretical ferret ( ) play different roles in an ecosystem even though they all belong to the same order.
If you're doing heritability, I assume you're working with twins? Could you use paired distances somehow? Maybe look at what's possible with q2-longitudinal?
It's what I tend to use on my models.
Best,
Justine