Trim alignment produces shorter reads than fragment extraction


I'm trying to compare regions extracted using RESCRIPt's trim-alignment and q2-feature-classifier's extract-region. My expectation is that I should get similar length sequences out of both techniques. Instead, I'm letting longer reads out of feature-classifier (about 130nt) and short reads (60nt) out of rescript. I expect the longer read length.

I'm using TGGCGGACGGGTGAGTAA and CTGCTGCCTCCCGTAGGA as my forward and reverse primers, repectively. (Both 5'-3'); I've also tried the reverse complement of the reverse on greengenes 13_8 88%. It's not clear to me why the performance is so different, but I'm concerned that trim-alignment with primers isn't doing quite what it's supposed to, and I'm not sure how to coerce it to pkay nicely.


Hi @jwdebelius ,

Thanks for reporting this. I did some testing this morning and could replicate this, but could also see that the aligned rep seqs from GG are actually shorter (after degapping) than their unaligned counterparts, as are the amplicons generated from the degapped alignment (I also confirmed via visual inspection) — so I think in part the issue is that the alignments are masked, accounting for the lost length.

That is not to say that trim-alignment is blameless, just that your observations might be due to the length differences in raw vs. aligned+masked sequences.

I also see that these primers do not align (with extract-reads) to a rather large number of the GG seqs, presumably because the start position is downstream of the forward primer.

Here's a snipped (of GG 77%, and dropping all seqs that do not align to one or the other primer). These are the sequence lengths before and after trimming (with extract-reads). All aligned seqs are degapped.

seqs aligned seqs (degapped) trimmed_seqs trimmed_degapped
576962 1487 1352 223.0 188.0
573196 1447 1275 237.0 191.0
539547 1424 1205 278.0 190.0
300695 1390 1228 229.0 191.0
294612 1414 1234 238.0 190.0
... ... ... ... ...
4459468 1472 1272 221.0 189.0
4460175 1523 1243 219.0 190.0
4463866 1388 1201 222.0 191.0
2909029 2109 1216 243.0 215.0
4271527 1380 1252 223.0 215.0

I do see that trim-alignment is producing even shorter seqs for most of these, but the sense I am getting (from attempting to align the raw GG seqs with mafft and then trimming) is that that region is very gappy in the GG seqs, so mafft might be having some trouble finding the correct binding site (esp. if it is being mangled by the mask) — the issue in this case seems to be some garbage sequences in GG that are leading to bad alignments (because the primers are exact matches to many seqs, ~10% of the 88% otus?), though the primers also might not be great. Manually inspecting the alignments, I see that this appears to be the case: that mafft is inserting a very large number of gaps within the primer sequence when aligning against GG... so much so that in some cases the forward primer actually aligns downstream of the reverse primer! But even in the best case scenario the final position (which is where trim-alignment trims) is much further along than it should be.

So I guess my impression at the "end of the day" (or rather a long morning spent debugging) is that:

  1. these primers do not align well to some seqs in GG, which looks like an issue with a few "bad apples", but could also be the primers (I guess no "universal" primer is universal, it is known)***.
  2. trim-alignment is not really an in silico PCR method, it really just finds the alignment positions of short fragments vs. a larger alignment and trims arbitrarily. So we might say that it has the fragile assumption that the input primers are good, well-validated primers, and the input reference does not contain any garbage seqs (even a few can obviously spoil the picnic)
  3. hence, this might be considered an edge case where trim-alignment simply does not work well when the input is not well-scrubbed (db, primers, or both).

So this seems to boil down to the old story in bioinformatics/life of "garbage in, garbage out".

The solutions:

  1. some garbage collection is probably in order, to remove sequences in GG (or whatever input reference) with high numbers of ambiguous characters or that align poorly.
  2. maybe this should be better documented? i.e., that users should manually inspect these alignments?
  3. trim-alignment could raise a warning if the primers align with a large number of internal gaps.
  4. trim-alignment could output a visualizer to allow direct inspection, i.e., to manually confirm and diagnose issues like this.

I am opening a few issues in RESCRIPt related to the last points, but right now this looks more like a feature (of the inputs), not a bug :wink:

but we can also explore some more to make sure, when we get a chance to loop back to these issues in GH.

***since the question here is rather about whether there is a bug in rescript and not the primers, I did not really look into primer quality using a proper in silico PCR tool to assess coverage etc. But quickly looking at the SMURF paper (where these come from?), it looks like they were maybe designed by just finding the primers with the most exact matches to the GG database? This seems to leave a very large proportion that do not align perfectly (but how badly they align I did not measure). Hence I do not want to pass judgment one way or another — i.e., whether it is a "few bad seeds" in GG, or low-coverage primers. Either way, it seems like this is the "edge case" that users should inspect when using trim-alignment: this method uses a brute force approach and does not drop reference seqs that poorly align to a primer (since the point of this method is to trim ragged sequences that might only align to one primer but not both). It assumes that all seqs align but does not inspect quality of alignment automatically (the onus rests on the user).

1 Like

Thanks @Nicholas_Bokulich! This is really helpful.

I was hoping to wrap this into Sidle as a way to handle both primerless reads and reads with a primer in a single pipeline; I may need to re-think that. I may also need to think about finding an unmasked reference for those reads or maybe generating a masked reference once the data is filtered. (Currently the recommendation is to filter before database generation, in cases where the aligned reads are involved, it needs to be filter -> align -> trim); I may need to re-think this.

I'll think about what kind of visualizer could be used for the alignment; it may be past the scope of what I can help implement. But, yes, the conversation on GitHub would be good.


1 Like