Hi @rparadiso,
I am pinging @cjone228 to describe her pipeline decisions better than I can! But here are my thoughts:
I think the issue with using mafft here is that the amplicons are targeting different sections of the 16S, so mafft will not be able to align those properly... fragment-insertion will insert those fragments into a reference tree built on full 16S sequences, so will accommodate this multi-amplicon design.
I think you've already seen @cjone228's description above (I'm just quoting here to give context for others following along!)
Having multiple amplicons means that counting ASVs will lead to highly inflated measures of richness (e.g., if you have 7 different amplicons in the pool, then richness scores will be 7X the actual richness). A phylogenetic alpha-diversity method like Faith's PD would be unaffected by this issue, but all other metrics will be. A few workarounds:
- You could collapse your feature table on taxonomy and then calculate alpha diversity metrics. Ideally, this will "collapse" amplicons that hit the same reference sequence but at different sections of the 16S. It could still run into issues if some amplicons assign more deeply than others (we know this to be a problem!) so alpha diversity could still be inflated. One possibility is to use a taxonomy assignment method like classify-consensus-blast and take only the top hit, and only use these results for calculating alpha diversity; similar to using sepp the hope is that these would "splice" into the same reference taxonomy.
- use closed-reference OTU clustering to cluster sequences into full-length reference sequences. Once more, we are "splicing" the amplicons from different regions into single full-length sequences. The problem: closed-ref OTU clustering has its own issues you are probably already familiar with (and if not you can read about these more in the tutorials at qiime2.org)
- Lazy methods: the multiple amplicon issue will affect all samples evenly, so your richness estimates will be highly inflated in all samples, presumably at an even rate. If sequence depth is high enough, you could just carry on as normal, based on the assumption that this is a bias impacting all samples so you can compare within the same study using conventional alpha diversity metrics, but not between studies (since the richness is inflated!). Another option: try dividing observed richness by the number of amplicons to get an estimate of the richness you'd observe with a single amplicon. You could also try filtering out individual amplicons to see what your richness looks like with each individual amplicon to see how well this method works... getting some external validation (compare to other studies) is also recommended if you go that route.
Just a few ideas to think over! Interested to hear what @cjone228 and others think. I hope that helps.