Importing BOLD COI sequence data into QIIME 2 / RESCRIPt

SoilRotifer · July 9, 2020, 5:44pm

Yes! This is a very big concern for me, and also relates a little, to what I've mentioned here. We are planning to implement an alignment based extraction approach. One way to do this, is identical to what I've done with my prototype SILVA parser, see step 7.

I have a few other ideas up-my-sleeve...

-Mike

devonorourke · July 10, 2020, 12:27am

Hi @SoilRotifer,

A couple of questions about the alignment work:

First, I might be doing something terribly wrong, but it looks like the first thing I need to do with my fasta file is align it. I tried just running a subset of my COI sequences that had gaps removed and were homopolymer/ambiguous N's removed (COI_subsetSeqs.fasta), but this failed...

mafft \
> --addfragments FolmerCOI_primers.fasta \
> --mapout COI_subsetSeqs.fasta > FolmerCOI_primers_aln_map.txt
nadd = 2
Not aligned!

I thought maybe this is because my input isn't an alignment file itself, but simply fasta sequences? There are 10,000 sequences in the COI_subsetSeqs.fasta file, in this kind of format:

>10001951
GACCCTATACTTCCTTCTAGGTATTTGATCCGGACTAACCGGCACATCCATAAGACTTTT...{and more ATCG}...
>10004549
AACACTTTATTTTATTTTTGGAATTTGAGCAGGAATAGTAGGATCATCCTTAAGATTACT...{and more ATCG}...

It looks like you were running your examples using a dataset that was previously aligned. I figured that I should then just try aligning this subset of sequences then, so I ran those 10k COI sequences through MAFFT:

mafft --thread 4 --threadtb 5 --threadit 0 --inputorder --anysymbol \
--auto COI_subsetSeqs.fasta > COI_subsetSeqs_mafft.out

and then used that alignment output as the input instead for the same (previously failed) mafft bit earlier:

mafft --addfragments FolmerCOI_primers.fasta \
--mapout COI_subsetSeqs_mafft.out > FolmerCOI_primers_aln_map.txt

And then it worked, but that led me to the next question... Can you please elaborate a bit on how I would go about inferring the FolmerCOI_primers.fasta.map output from mafft?

In your github post, you write:

You can sanity-chack the alignment in your favorite alignment viewer. Anyway, this looks good, lets extract amplicon region from the original alignment. Note the values I use for the start and end position of the alignment

It looks like you've defined the positions in the next code block as the interior flanking position of the primers. What exactly is the sanity check? My output, below, seems to me to suggest that these two primers are spanning > 1000 bp apart... that doesn't seem right. I'd expect an ~ 700 bp fragment:

>Forward_LCO1490_GGTCAACAAATCATAAAGATATTGG
# letter, position in the original sequence, position in the reference alignment
g, 1, 53
g, 2, 54
t, 3, 55
c, 4, 56
a, 5, 57
a, 6, 58
c, 7, 59
a, 8, 60
a, 9, 61
a, 10, 62
t, 11, 63
c, 12, 64
a, 13, 65
t, 14, 66
a, 15, 67
a, 16, 68
a, 17, 69
g, 18, 70
a, 19, 71
t, 20, 72
a, 21, 73
t, 22, 74
t, 23, 75
g, 24, 76
g, 25, 77
>Reverse_HCO2198_TAAACTTCAGGGTGACCAAAAAATCA
# letter, position in the original sequence, position in the reference alignment
t, 1, 1236
a, 2, 1237
a, 3, 1238
a, 4, 1239
c, 5, 1240
t, 6, 1241
t, 7, 1242
c, 8, 1243
a, 9, 1244
g, 10, 1245
g, 11, 1246
g, 12, 1247
t, 13, 1248
g, 14, 1249
a, 15, 1250
c, 16, 1251
c, 17, 1252
a, 18, 1253
a, 19, 1254
a, 20, 1255
a, 21, 1256
a, 22, 1257
a, 23, 1258
t, 24, 1259
c, 25, 1260
a, 26, 1261

SoilRotifer · July 10, 2020, 12:49pm

Hi @devonorourke

By performing a

sanity-check the alignment in your favorite alignment viewer.

I basically mean that you should visually inspect how well the primers align to your sequences. To make sure they are being mapped correctly. Often this will not be a problem, but I have encountered cases when my primers did not align well and I have to fiddle with alignment parameters or make manual adjustments with an alignment editor. I typically view my alignments with Unipro UGENE.

Also, visualizing the alignment will help explain why the primers appear 1000 alignment positions apart, i.e. not 1000 bp apart. The numbers on the right-most column denote the alignment column positions, the length of the alignment will be longer than the sequences due to any indels added to make the alignment. If you look at the positional values of the 16S V4 primers in my tutorial you'll see that a single primer sequence spans ~1,800 alignment positions! The SILVA alignment is 50,000 alignment positions, lots of indels.

-Mike

devonorourke · July 10, 2020, 12:56pm

Great, thanks @SoilRotifer.

I realized after fighting a bit with the data last night that I was going to get caught needing to circle back and use an aligned dataset from the outset. I had started my COI database workflow by removing gaps and the leading/trailing N's, but it occurred to me as I was running out of coffee (read: brain power) that I couldn't use the quality-filtered fasta file any more to do this particular work mapping primers and then coordinate-based filtering.

My lingering question/concern is whether or not I can do this coordinate-based filtering in the first place. While it looks like the raw BOLD sequence are aligned to each other, the sequences with gaps aren't even all the same length. So they're probably aligned initially, but then truncated to some region by BOLD. argh.

I think I'll have to go back to the drawing board and figure out a way to invoke this coordinate-based filtering approach with a smaller dataset (unless I can quickly align 3.5 million sequences...?). I suppose I could try to dereplicate these data first? In my experience previous BOLD database work, dereplicating reduced about 3 million records to 1 million. Then perhaps I could try running MAFFT to create a proper alignment file, and then try this coordinate filtering?

Thanks

SoilRotifer · July 10, 2020, 1:12pm

Oh... this is bringing back nightmares.

Yes, quite the mystery.

I think using your ~ 1 million sequence set, for an alignment, might be worth a shot.

-Mike

devonorourke · July 10, 2020, 1:14pm

Any particular recommendations for tools that are used to tackle such a massive data set? Is MAFFT the best approach? I have no experience on this front.

SoilRotifer · July 10, 2020, 1:18pm

I think mafft is great for this. Given the number of sequences in your data set you may need to enable the --p-parttree option.

devonorourke · July 10, 2020, 3:35pm

I'll give it a go @SoilRotifer.
Looking into the mafft manual a bit, and it looks like I can create these alignments fastest with:

mafft --retree 1 --maxiterate 0 --nofft --parttree

There are several solutions that MAFFT recommends with big datasets, so if anyone else has a sense of other parameters that would be sensible for aligning 2.3 million sequences (that's how many are in the dereplicated file!), I'd love to hear it.

My concern is that this number of sequences is going to be impossible to fit into a temporary disk. From a MAFFT page about parallelization:

Size of the data stored in the temporary directory is proportional to N2, where N is the number of sequences.
It also depends on the similarity. In one case, when N∼80,000, the data size ∼150GB. Be careful about disk space and quota.

Maybe parallelization is another option? Trying not to have to run anything on our cluster with MPI... that wasn't fun the last time I tried with genome annotations...

devonorourke · July 24, 2020, 12:59pm

Quick update @SoilRotifer and @Nicholas_Bokulich,
The mafft alignment failed after 12 days. Not clear what to make of the error message (see below). It looks like there was sufficient memory to finish all the alignment tasks, but something failed when it went to create a temporary file... I posted the same log message to mafft online so hopefully they'll have some hints.

Any other thoughts on how to possibly build up an alignment by subsampling? I wondered if maybe I should somehow cluster my sequences at something like 97% (which would reduce them to probably about 200k sequences instead of 1800k) and align just those sequences first. Then, take the remaining ~1600k unaligned sequences that weren't clustered and align them to the aligned dataset. I think this might work with mafft --add?

Thanks for any tips or tricks you can think of!

m Alignment 1818033-890
Done.



----------------------------------------------------------------------------

nseq = 1818923
groupsize = 1818924, partsize=50

----------------------------------------------------------------------------
splittbfast (nuc) Version 7.471
alg=M, model=DNA200 (2), 1.53 (4.59), -0.00 (-0.00), noshift, amax=0.0
1 thread(s)

mv: cannot stat `hat3.seed': No such file or directory

Strategy:
 FFT-NS-DPPartTree-1 (Not tested.)
 ?

If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version 7.110 (2013 Oct).
It tends to insert more gaps into gap-rich regions than previous versions.
To disable this change, add the --leavegappyregion option.

/scratch/dro49/conda/envs/rescript/bin/mafft: line 2687: /tmp/dro49/31595547/mafft.o4dmpecEvW/pre: No such file or directory

SoilRotifer · July 24, 2020, 10:29pm

Hi Devon,

Based on:

Are running this through another script? mv (move) is a unix command, and it is apparently unable to move files.

There are a couple things you can try:

If you want to stay within QIIME 2 then I think your idea of using maft-add is a good start. However, I'd start that off by making your first alignment from a specific clade.
- Then add new sequences to that alignment by clade.
- This will usually help make the process quicker.
External to QIIME 2, you can run mafft (or other tool) manually, to merge separate alignments together (sometimes called profile-profile alignments)
- Again, make separate alignments by clade, then merge them.

There are probably other ways, but those are two off the top of my head.

-Mike

devonorourke · July 28, 2020, 8:26pm

Hey @SoilRotifer,
What do you think about performing the whole process on just clade-specific groups?
So, rather than creating a gigantic alignment file of all animal COI from BOLD, then aligning primers to find the coordinate boundaries, then removing the gaps...

we just perform clade-specific alignments, aligning the primer set to those clade-specific alignments, degapping those alignments... then merging all the clades?

That should in principle be faster, right (parallelization notwithstanding). Wouldn't aligning within clades be faster?

Nicholas_Bokulich · July 28, 2020, 8:44pm

I love this idea! If the goal is just to find primer binding sites based on alignment, a massive global MSA is not necessary.

The only shortcoming I see is if all sequences in a clade fail to match the primer because all sequences are downstream of the primer site (maybe that primer was used to amplify all of those sequences, but was removed from all prior to uploading to BOLD? a distant possibility)

Another possibility: since the issue here is that extract-reads relies on a primer being inside a sequence to trim and keep, what if you use that method as a first pass, then align all failures (by clade) will a representative of the clade that has the primer sequence (pre extract-reads)? This might help you pass a significant number of sequences before resorting to alignment.

devonorourke · July 28, 2020, 8:47pm

That is way, way smarter.
I'll give that a go and report back

SoilRotifer · July 28, 2020, 8:49pm

@devonorourke, I second everything that @Nicholas_Bokulich, outlined. Great idea!

-Mike

devonorourke · August 7, 2020, 5:37pm

It turns out the alignment process can be dramatically simplified without invoking any clade-specific trickery. However, my initial foray into this new strategy has an element of stochasticity to it that I haven't worked out yet. All of these ideas come straight from the mafft developer, and we're working on identifying how to get rid of the randomness (of the success or failure) of this technique. Here's the general approach:

Gather a subset of sequences of the giant reference database, then build a tiny alignment using just that subset. Something like this should work:

## get just 2,000 sequences from the original fasta file:
seqkit sample --rand-seed 101 --number 2000 --two-pass all_COI_refSeqs.fasta -w 0 > subset_COI_refSeqs.fasta

## make the alignment using just those 2,000 sequences
mafft --auto --thread -1 subset_COI_refSeqs.fasta > reference_MSA

That job takes under a minute. !

The next step involves adding the primer sequences to that small reference_MSA alignment file. The anml_primers.fasta file contains the forward primer sequence, and the reverse complement of the reverse primer sequence (in fasta format). We add those two primer sequences to the original reference_MSA alignment file and create a new alignment file with the primers added to it. Why? To make sure our primers are going to map to the reference_MSA file in the right positions. Like this:

mafft --auto --addfragments anml_primers.fasta --keeplength --thread -1 --mapout --reorder reference_MSA > ref_primer

This step also takes only a minute to complete. !

Now the heavy lifting. We add all the remaining sequences to that primer_ref alignment file. Adding the --keeplength term is what makes this whole process work vastly faster. From the developer directly:

The "--keeplength" option is a key. It inserts gaps in an asymmetric manner:

If an insertion is in reference MSA, then the corresponding position in non-reference sequence is normally gapped, as in standard alignment calculation.
If an insertion is in a non-reference sequences, then the insertion is just removed. No gap is inserted into the reference MSA.
As a result, the output, consisting of (reference MSA + ~1.8 million non-reference sequences), has the exactly same length as the reference MSA.

We run this code to get this job done:

## create a list of the sequences to drop:
grep '^>' subset_COI_refSeqs.fasta | sed 's/^>//' > droplist.txt

## filter OUT the sequences to include only those NOT in the droplist.txt file
seqkit grep -v --pattern-file droplist.txt all_COI_refSeqs.fasta > seqsForGiantAlignment.fasta

## massive alignment time!
mafft --auto --addfull seqsForGiantAlignment.fasta --keeplength --thread -1 ref_primer > giant_alignment

This job... also fast! ! Instead of taking 12+ days to complete, it finishes in under an hour

Okay, so the bad news? The primers don't always align exactly where you'd expect them to. When I randomly subset those first 2,000 sequences to create the reference_MSA file, I can use a bunch of different random seeds to generate different sets of 2,000 sequences, and those reference sequences don't always produce a primer_ref alignment file with the primers correctly aligned to where I'd expect them to be. That's the stochastic part I haven't worked out yet.

Nevertheless, the guts of the process are there for users to rapidly resolve these huge alignment issues. Onward and yonward! (we're watching a lot of P.B.S. in the mornings in our household)...

SoilRotifer · August 7, 2020, 6:04pm

Hi @devonorourke,

Sorry for the delayed response. Thanks for sharing the details of your awesome sleuthing process!

About the primers not always landing where they should... this is the key problem I highlighted here, when it comes to using alignment positions for extraction, see the first bullet.

It might be best to make a very small alignment, in which you know the primers will align properly, then start your process of mapping all your other data to this alignment. I've done this before and sometimes with a little manual curation of the primer sequences to the alignment.

Do not try too hard to automate every aspect of this... in fact it is quite common that you will have to manually curate parts of the process. Which is no different from the combined manual and automated curation of other reference databases, e.g. SILVA.

-Mike

devonorourke · August 7, 2020, 10:58pm

New database is built @SoilRotifer @Nicholas_Bokulich !

Now testing it with fit-classifier and evaluate-cross-validate. Interestingly, after all the trimming and dereplicating, there are just under 800k COI sequences left. Compared to that 1.8 million in my first database, trimming to a certain primer region certainly appears to have an impact. More to come soon (I hope)!

One thing I thought you might find interesting - I plotted the distribution of sequence lengths following that coordinate-based trimming. What I found was that the vast majority of my sequences fell within a narrow region of sequence lengths. Note there were about 64k that had zero bases remaining, while about 70% of all sequences (about 1.2 million) contained lengths between 170-184 bp. I ended up trimming at 170 bp, as it seemed like a good balance between allowing for some flexibility in sequence length, while retaining nearly full length COI within that ANML primer region.

Side note: I've updated the shared Google doc to reflect the full workflow used to create this database. I haven't updated the section regarding testing (the fit-classifier, etc.), but the entire sequence of commands used to create this primer-trimmed, dereplicated, super quality filtered database is now done.

system · September 8, 2020, 4:58am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.