Hi @SoilRotifer,
A couple of questions about the alignment work:
First, I might be doing something terribly wrong, but it looks like the first thing I need to do with my fasta file is align it. I tried just running a subset of my COI sequences that had gaps removed and were homopolymer/ambiguous N's removed (COI_subsetSeqs.fasta), but this failed...
mafft \
> --addfragments FolmerCOI_primers.fasta \
> --mapout COI_subsetSeqs.fasta > FolmerCOI_primers_aln_map.txt
nadd = 2
Not aligned!
I thought maybe this is because my input isn't an alignment file itself, but simply fasta sequences? There are 10,000 sequences in the COI_subsetSeqs.fasta
file, in this kind of format:
>10001951
GACCCTATACTTCCTTCTAGGTATTTGATCCGGACTAACCGGCACATCCATAAGACTTTT...{and more ATCG}...
>10004549
AACACTTTATTTTATTTTTGGAATTTGAGCAGGAATAGTAGGATCATCCTTAAGATTACT...{and more ATCG}...
It looks like you were running your examples using a dataset that was previously aligned. I figured that I should then just try aligning this subset of sequences then, so I ran those 10k COI sequences through MAFFT:
mafft --thread 4 --threadtb 5 --threadit 0 --inputorder --anysymbol \
--auto COI_subsetSeqs.fasta > COI_subsetSeqs_mafft.out
and then used that alignment output as the input instead for the same (previously failed) mafft bit earlier:
mafft --addfragments FolmerCOI_primers.fasta \
--mapout COI_subsetSeqs_mafft.out > FolmerCOI_primers_aln_map.txt
And then it worked, but that led me to the next question... Can you please elaborate a bit on how I would go about inferring the FolmerCOI_primers.fasta.map
output from mafft?
In your github post, you write:
You can sanity-chack the alignment in your favorite alignment viewer. Anyway, this looks good, lets extract amplicon region from the original alignment. Note the values I use for the start and end position of the alignment
It looks like you've defined the positions in the next code block as the interior flanking position of the primers. What exactly is the sanity check? My output, below, seems to me to suggest that these two primers are spanning > 1000 bp apart... that doesn't seem right. I'd expect an ~ 700 bp fragment:
>Forward_LCO1490_GGTCAACAAATCATAAAGATATTGG
# letter, position in the original sequence, position in the reference alignment
g, 1, 53
g, 2, 54
t, 3, 55
c, 4, 56
a, 5, 57
a, 6, 58
c, 7, 59
a, 8, 60
a, 9, 61
a, 10, 62
t, 11, 63
c, 12, 64
a, 13, 65
t, 14, 66
a, 15, 67
a, 16, 68
a, 17, 69
g, 18, 70
a, 19, 71
t, 20, 72
a, 21, 73
t, 22, 74
t, 23, 75
g, 24, 76
g, 25, 77
>Reverse_HCO2198_TAAACTTCAGGGTGACCAAAAAATCA
# letter, position in the original sequence, position in the reference alignment
t, 1, 1236
a, 2, 1237
a, 3, 1238
a, 4, 1239
c, 5, 1240
t, 6, 1241
t, 7, 1242
c, 8, 1243
a, 9, 1244
g, 10, 1245
g, 11, 1246
g, 12, 1247
t, 13, 1248
g, 14, 1249
a, 15, 1250
c, 16, 1251
c, 17, 1252
a, 18, 1253
a, 19, 1254
a, 20, 1255
a, 21, 1256
a, 22, 1257
a, 23, 1258
t, 24, 1259
c, 25, 1260
a, 26, 1261