UniFrac and Phylogenetic Methods with Full-length 16S-ITS-23S Sequences

Hi all,

I have a question that I'm guessing most here haven't had to deal with. It concerns how to handle the analysis for sequencing data that contains the ITS sequence between the 16S and 23S rRNA genes (the full rRNA operon).

Because most bacteria have multiple copies of the rRNA operon, and each copy can differ from one another, sequencing beyond just the 16S rRNA gene can give more granular information on the bacteria that are present. Here, the ITS region can be helpful in distinguishing different strains within the same species, as it can be highly variable. The ITS region can also encode things like tRNAs. However, not all copies of the rRNA operon will have an ITS that is the same length.

Which leads to my question. Is it still appropriate to use the full operon to generate phylogenies, ie for use with UniFrac? I ask because the presence of a tRNA in one copy of the operon but not another from the same bacterium doesn't really reflect an evolutionary history between the two sequences like a tree would suggest. Because of that, would it be improper to use this tree? The alternative is to do an in silico PCR to extract only the 16S region, and create a tree based only on that. This eliminates the problem posed by the ITS, but also eliminates a lot of data and the benefits of sequencing this larger amplicon instead of just the full 16S gene.

I would greatly appreciate any insight and thoughts from others in the community!

This is a fascinating question!

Is this hypothetical or have you sequenced this? What are your primers and sequencing platform? :face_with_peeking_eye:

Yeah, this feels like it would break something, but does it?
Steps: reads -> MSA -> tree building
Programs: nanopore -> mafft -> fasttree

The missing regions would show up as gaps in the MSA, especially dangerous because you would still get an MSA, but it may be totally useless for inferring relatedness. Would a clever scoring matrix for the MSA solve this problem?

You may have found this already, but it not, check out this long discussion about multi-region sequencing. All their regions are separate, which is equivalent to your proposal of cutting out each region from the full length read. They don't know what regions are connected together, but that's not an issue for you!

Thank you for bringing this question to the forums.

Hi Colin, thanks for your reply!

I haven't sequenced the entire rRNA operon, but I have been using primers from Intus Bioscience (formerly Shoreline Biome) to amplify the entire 16S and ITS regions, alongside a portion of the 23S (~2,500 bp total). The forward primer is just the classic 27F, but the reverse is a custom primer for "StrainID" (the sequence is AGTACYRHRARGGAANGR). We've done all of our sequencing runs on a PacBio Sequel IIe.

I hadn't seen that thread, but that could be helpful! I'll be sure to look through it. Thank you!

Another thought - my original idea was to just trim down to the 16S region (I've found I have to do this anyways for the taxonomic assignment), but maybe it would be better to just excise the ITS, and keep both the 16S and 23S. I would need a forward primer sequence for the 23S rRNA, but maybe that could help keep more info in... Lot's to think about.

Thanks again!

In my mind, I think the biggest issues are consistency in the MSA and that's hard because

  • the region is very long
  • some parts are more variable/conserved than others
  • we expect big gaps due to variable length

Removing ITS may help with gaps, but the other problems remain. A lot of the problems with multiple sequence alignment and phylogenetic inference are hidden when using a single hypervariable region.

This may be a good question to bring to the MUSCLE developer. He's been thinking about MSA forever and he's on the forums from time to time (@.Robert_Edgar). You could also reach him by email.