Best practices for generating a custom SeppReferenceDatabase qza?

Hi!

I'm trying to create a custom SeppReferenceDatabase. I know from this discussion that to generate a custom SEPP database, I need three files: a fasta alignment, a Newick tree, and a RAxML info file.

Progress so far: I generated a reference tree using 5 genes (4 protein-coding genes plus full-length 16S). The final alignment was almost 12k bp. (I'm using multiple genes because I want to differentiate among subclades within a single bacterial genus.) I used PartitionFinder to estimate base substitution rates for each of the 3 codon positions for each of the four protein-coding gene, and I used a single partition for the entire 16S gene.

In the post linked above, @dethlefs and @Stefan agreed that it's preferable to trim the reference alignment to just the fragment region being placed, because this should significantly decrease runtime. So: I've trimmed my alignment to the amplicon region.

My question: If I use the~460 bp trimmed alignment, will the RAxML_info file generated for my full phylogeny (based on ~12k bp, with multiple partitions) be useless? If SEPP is using the base substitution rates from the RAxML_info file, then how would SEPP know which base substitution rates belong to my amplicon region? Is my intuition correct that this will be a problem?

Possible solution: To correctly estimate the substitution rates for my amplicon region, I'm considering building a new phylogeny with just the trimmed alignment, and using the resulting RAxML_info file plus my original phylogeny and the trimmed alignment to generate the custom SEPP database. Is this a viable course of action? ...Or should I abandon the attempt to be efficient and just use the full 12k alignment plus the full tree's info file, and let SEPP run for a longer time?

Thank you for reading!

Version of QIIME 2: qiime2-amplicon-2024.2 installed via conda

Hi Alison,

Siavash is the expert for this topic, I am only guessing here. From what I know, the placement of a query sequence is only determined hierarchically via sequence similarity (chain of hmmsearch queries against shrinking alignments of sub-clades in the reference tree). Once a placement is found, the new tip is inserted into the reference and I am not totally sure how the original branch length is divided to position the new internal node. From what I recall, this it is just split by half. Therefore, I figure substitution rates do not impact the insertion tree at all and you don't have to worry about fiddling with the info file.

But you might want to double check with members of Siavash's team
Stefan

Thank you Stefan! When I have a moment I will message Siavash and post relevant answers here, since it could be useful information for others too.

1 Like