Trouble generating SeppReferenceDatabase qza

dethlefs · October 12, 2020, 9:59pm

Sorry for the slow reply. You're right that phylogenies made with short fragments are likely to be less accurate, and maybe downright misleading. With SEPP neither the output tree nor the reference tree is calculated de novo from short fragments...the topology of the tree is taken from the reference. In this case, it's a trimmed version the Silva 138 SSU 16S guide tree, which has a complicated history that I wouldn't be able to explain as well as the Silva folks, although it's derived from a full-length 16S maximum likelihood tree and has received expert curation from specialists in certain taxa.

Given that you have a reference tree that you want to use, and an alignment of the sequences associated with the leaves of that tree, SEPP's job is to make the best placement of new fragments into that tree. It has to align fragments to the existing alignment, and for that step having the reference alignment trimmed is helpful. While @Stefan and Siavash think alignment accuracy will be good for fragments against the full length alignment (and I generally agree), my experience with automated aligners in this context is that bases near the end of the fragment can get spread out to columns outside the intended region.

Imagine that a fragment has GCA for the first 3 bases, and the first 3 columns that we know to be appropriate (based on our PCR primers) are consistently GCC in the most similar reference seqs to that fragment. Imagine further that the triplet just prior to those columns in the similar reference seqs is GCA. The aligner doesn't know what PCR primers we used to obtain the fragments, and its mismatch and gap penalties might result in an optimal score when the fragment GCA is aligned just outside the appropriate area, with a gap spanning the next three columns. In effect, trimming the reference alignment is telling the aligner what we know about our PCR primers, in this example forcing it to chose GCA aligned to GCC with a mismatch inside the appropriate area of the reference alignment.

But that's not likely to have a huge impact since it will make a small difference in the alignment scores of probably a small number of fragments, which might or might not even affect their placement. @Stefan and Siavash point out the more important benefit for those lacking free access to supercomputers: if your aligner doesn't have to waste time for each fragment testing alignments that include the majority of columns outside the appropriate region, it can more quickly find the best alignment for each fragment within the appropriate region.

I've glossed over some steps...prior to fragment placement, the Silva guide tree (and perhaps most big reference trees) will have to have polytomies cleaned up, and the branch lengths of the guide tree will have to be re-estimated, along with the evolutionary parameters that go into maximum likelihood modeling of the phylogeny. I suppose if someone were to use the SEPP output tree to make branch length dependent arguments about something or other, or treat the associated ML evolutionary parameters as precise estimates, that could be bad. (And would be foolish.)

But usually, all we want is the most likely placement of our fragments into a tree topology that we trust (or at least are familiar with). I'd argue that the re-estimated branch lengths and parameters (following trimming) provide the best chance of getting the fragments placed accurately on the fixed-topology tree (because they're the best estimates for fragment alignment columns), even if estimates of average mutation rates and transition probabilities etc. are expected to be more precise when using longer sequences.

Regards,
Les

P.S. I haven't forgotten about sharing my work...still haven't gotten around to generating a README file, but I hope to soon.