Hi @jwdebelius ,
Thanks for reporting this. I did some testing this morning and could replicate this, but could also see that the aligned rep seqs from GG are actually shorter (after degapping) than their unaligned counterparts, as are the amplicons generated from the degapped alignment (I also confirmed via visual inspection) — so I think in part the issue is that the alignments are masked, accounting for the lost length.
That is not to say that trim-alignment
is blameless, just that your observations might be due to the length differences in raw vs. aligned+masked sequences.
I also see that these primers do not align (with extract-reads
) to a rather large number of the GG seqs, presumably because the start position is downstream of the forward primer.
Here's a snipped (of GG 77%, and dropping all seqs that do not align to one or the other primer). These are the sequence lengths before and after trimming (with extract-reads
). All aligned seqs are degapped.
|
seqs |
aligned seqs (degapped) |
trimmed_seqs |
trimmed_degapped |
576962 |
1487 |
1352 |
223.0 |
188.0 |
573196 |
1447 |
1275 |
237.0 |
191.0 |
539547 |
1424 |
1205 |
278.0 |
190.0 |
300695 |
1390 |
1228 |
229.0 |
191.0 |
294612 |
1414 |
1234 |
238.0 |
190.0 |
... |
... |
... |
... |
... |
4459468 |
1472 |
1272 |
221.0 |
189.0 |
4460175 |
1523 |
1243 |
219.0 |
190.0 |
4463866 |
1388 |
1201 |
222.0 |
191.0 |
2909029 |
2109 |
1216 |
243.0 |
215.0 |
4271527 |
1380 |
1252 |
223.0 |
215.0 |
I do see that trim-alignment
is producing even shorter seqs for most of these, but the sense I am getting (from attempting to align the raw GG seqs with mafft and then trimming) is that that region is very gappy in the GG seqs, so mafft
might be having some trouble finding the correct binding site (esp. if it is being mangled by the mask) — the issue in this case seems to be some garbage sequences in GG that are leading to bad alignments (because the primers are exact matches to many seqs, ~10% of the 88% otus?), though the primers also might not be great. Manually inspecting the alignments, I see that this appears to be the case: that mafft is inserting a very large number of gaps within the primer sequence when aligning against GG... so much so that in some cases the forward primer actually aligns downstream of the reverse primer! But even in the best case scenario the final position (which is where trim-alignment
trims) is much further along than it should be.
So I guess my impression at the "end of the day" (or rather a long morning spent debugging) is that:
- these primers do not align well to some seqs in GG, which looks like an issue with a few "bad apples", but could also be the primers (I guess no "universal" primer is universal, it is known)***.
-
trim-alignment
is not really an in silico PCR method, it really just finds the alignment positions of short fragments vs. a larger alignment and trims arbitrarily. So we might say that it has the fragile assumption that the input primers are good, well-validated primers, and the input reference does not contain any garbage seqs (even a few can obviously spoil the picnic)
- hence, this might be considered an edge case where
trim-alignment
simply does not work well when the input is not well-scrubbed (db, primers, or both).
So this seems to boil down to the old story in bioinformatics/life of "garbage in, garbage out".
The solutions:
- some garbage collection is probably in order, to remove sequences in GG (or whatever input reference) with high numbers of ambiguous characters or that align poorly.
- maybe this should be better documented? i.e., that users should manually inspect these alignments?
-
trim-alignment
could raise a warning if the primers align with a large number of internal gaps.
-
trim-alignment
could output a visualizer to allow direct inspection, i.e., to manually confirm and diagnose issues like this.
I am opening a few issues in RESCRIPt related to the last points, but right now this looks more like a feature (of the inputs), not a bug 
but we can also explore some more to make sure, when we get a chance to loop back to these issues in GH.
***since the question here is rather about whether there is a bug in rescript and not the primers, I did not really look into primer quality using a proper in silico PCR tool to assess coverage etc. But quickly looking at the SMURF paper (where these come from?), it looks like they were maybe designed by just finding the primers with the most exact matches to the GG database? This seems to leave a very large proportion that do not align perfectly (but how badly they align I did not measure). Hence I do not want to pass judgment one way or another — i.e., whether it is a "few bad seeds" in GG, or low-coverage primers. Either way, it seems like this is the "edge case" that users should inspect when using trim-alignment
: this method uses a brute force approach and does not drop reference seqs that poorly align to a primer (since the point of this method is to trim ragged sequences that might only align to one primer but not both). It assumes that all seqs align but does not inspect quality of alignment automatically (the onus rests on the user).