Hi!
I'm trying to create a custom SeppReferenceDatabase. I know from this discussion that to generate a custom SEPP database, I need three files: a fasta alignment, a Newick tree, and a RAxML info file.
Progress so far: I generated a reference tree using 5 genes (4 protein-coding genes plus full-length 16S). The final alignment was almost 12k bp. (I'm using multiple genes because I want to differentiate among subclades within a single bacterial genus.) I used PartitionFinder to estimate base substitution rates for each of the 3 codon positions for each of the four protein-coding gene, and I used a single partition for the entire 16S gene.
In the post linked above, @dethlefs and @Stefan agreed that it's preferable to trim the reference alignment to just the fragment region being placed, because this should significantly decrease runtime. So: I've trimmed my alignment to the amplicon region.
My question: If I use the~460 bp trimmed alignment, will the RAxML_info file generated for my full phylogeny (based on ~12k bp, with multiple partitions) be useless? If SEPP is using the base substitution rates from the RAxML_info file, then how would SEPP know which base substitution rates belong to my amplicon region? Is my intuition correct that this will be a problem?
Possible solution: To correctly estimate the substitution rates for my amplicon region, I'm considering building a new phylogeny with just the trimmed alignment, and using the resulting RAxML_info file plus my original phylogeny and the trimmed alignment to generate the custom SEPP database. Is this a viable course of action? ...Or should I abandon the attempt to be efficient and just use the full 12k alignment plus the full tree's info file, and let SEPP run for a longer time?
Thank you for reading!
Version of QIIME 2: qiime2-amplicon-2024.2 installed via conda