Hello,
I used the qiime phylogeny align-to-tree-mafft-fasttree pipeline to construct phylogenetic trees from fungal ITS1 ASVs (trimmed by ITSxpress). I opened the resulting MAFFT alignment in a third-party sequence analysis software (Geneious) and was quite surprised by the poor alignment (see below).
I then aligned the same representative sequences used above in a third-party software (Geneious) using MAFFT plugin (v. 1.5.0) and obtained a much better alignment (see below).
I tested MAFFT in q2 using qiime alignment mafft and obtained the same poor alignment as in the q2 pipeline (not surprising, the pipline is calling that same command). According to conda list, MAFFT is present in q2 in version 7.526, but I am not sure if this is the true version indicator.
I tried to get most of the parameters out from q2-mafft (verbose), this is probably the most important part:
inputfile = orig
133 x 202 - 77 d
nthread = 1
nthreadpair = 1
nthreadtb = 1
ppenalty_ex = 0
stacksize: 8192 kb
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00
The (default) parameters used for the MAFFT plugin in Geneious are:
Algorithm: Auto
Scoring Matrix: 200PAM / k=2
Gap open penalty: 1.53
Offset value: 0.123
The parameters seem to be very similar (scoring matrix, gap open penalty), but I am not perfectly sure...
I have two questions:
Is there any possibility to adjust parameters in q2 alignment mafft in order to get better alignments?
What is the exact order of q2 commands in align-to-tree-mafft-fasttree in order to re-import an alignment created by external MAFFT and to create a tree artifact and for visualization by q2 empress community-plot? It will break q2 provenance but I would accept it for the moment.
On a side note, it's kind of hard to visually compare these alignments because their output order has changed. The Geneious one does have 'better vibes' I think because it's sorted to place similar features close together.
How are you numerically measuring the quality of the MSA?
(The Geneious one is shorter, which should mean less gaps, which could be good...)
Hi @colinbrislawn
Thank you very much for your detailed comments which should allow me to create a q2 tree using an external-created MSA.
One criterion is the alignment length, as you already stated; another one is the identity profile, visible on top of the graphs. I consider a higher proportion of conserved 'blocks' within ITS regions, separated by various insertions in between, as a good MSA. I would expect this due to the evolution of higher sequence length variability in ITS1, likely caused by insertion/deletion events rather than single nucleotide mutations. I hope I am not wrong with this assumption.
Sorry for the hard-to-read images. I attached two 'better' sections of the alignment available after fasttree calculation; here the order of sequences defined by the genetic distance, and I selected a few feature ids from both alignments to clarify my point. The subtrees are not identical due to different positions of gaps in the alignemnts.
After reading your comments, I am no longer sure whether I should expect similar (not identical) alignments by MAFFT in q2 versus Geneious.
Best,
Thank you for sharing that new figure. That does make sense to me.
Changes in scoring settings will cause changes, like you shared in your first post. And some programs may expose different options and use different defaults, making a like-for-like comparison harder.
Nevertheless, because both programs are calling MAFFT, replicating the results of one using the other is a good idea and should be possible!