Hi everyone!!
@rparadiso First of all, thanks for the kind words and for contributing to our discussion!! Funnily enough, we were talking about some similar issues in our lab meeting this morning and were wondering about some related things... we understand that because we are sequencing multiple variable regions at once we must incorporate phylogeny in our core metrics analyses. We also understand the concept of the artificially increased richness in alpha diversity if you do not include phylogeny, but we were all a bit confused on how using an insertion tree will account for this:
@Nicholas_Bokulich Why is this? We were having trouble articulating the why to our PI.
After reading your reply to @rparadiso, several other questions sparked in our minds. A lot of what you explained were concepts we had thought of in our non-bioinformatician brains but hadn't been able to fully articulate...
So does this mean that sequences from different variable regions but from the same bacteria will be placed together on the reference tree when using sepp-fragment-insertion? Even though the ASVs themselves are very different in sequence?
As for the workarounds, we have questions/comments for each:
Is this essentially saying to do taxonomic classification early on in your analysis pipeline (ie. sort of in lieu of phylogeny?) and using this to take care of the richness inflation issue? And if so, for collapsing taxonomy, what would one do if one variable region identifies down to the species and another only identifies down to genus? Would you decide to only collapse down to genus to incorporate as much info as possible? Or when you say "take only the top hit" do you mean the top % ID on BLAST?
Putting all of the variable regions together into one big 16S gene is something we had considered a while back but didn't really know how to accomplish. We had considered SMURF, but @jwdebelius had replied to us and said it basically wasn't a viable option:
In our current pipeline, we use DADA2 denoise-pyro, which as we understand it creates ASVs as opposed to OTUs. Therefore, would we even be in a position to be able to perform closed-reference OTU picking, or would we need to use a different method of denoising that creates OTUs? Also, is closed-reference OTU clustering another way of accomplishing what we wanted to do with SMURF (put all variable regions together into one consensus 16S sequence), and if so, how does one do that?
We have noticed using Ion Reporter software, which breaks down results per primer, that the breadth and depth of taxonomic classification per variable region varies quite dramatically (ie. V9 does very poorly, V4 does well, etc.). Therefore, we unfortunately aren't sure that this workaround would really be sound...
Finally, the above comment implies separating the different variable regions if we are interpreting this correctly - is there a way to do this without knowing the exact primer sequences (as is the case in our situation)?
Sorry for the monstrously long post, and thanks in advance for your valuable advice!