Importing BOLD COI sequence data into QIIME 2 / RESCRIPt

Hi @devonorourke,

We do indeed have plans to add multi-threading support to several of the RESCRIPt methods. But we wanted to sync our beta release with the QIIME 2020.6 release. It will take quite a while to run.

You can carry out the order of operations in that tutorial any way you’d like. I think it’d make more sense to trim to your amplicon region of interest first, then proceed to the next steps. So yes, totally sensible!

-Mike

Hello @devonorourke @SoilRotifer ,

I’ve been running through making my own COI classifier for some NGS outputs I’ve generated, and as I’ve gone through the process I’ve come across some of the same issues you’re talking about. I’m not sure if any of this will be helpful or not, but I though I’d throw in what I’ve done so far to see if it lines up with what you’re working towards or not.

My Seqs are from NY, USA, so I made a ‘North America’ classifier with just Canada and US Arhtropod seqs exported from BOLD.

I erased all dups from my fasta with:
awk ‘/^>/{f=!d[$1];d[$1]=1}f’ No_Am_seqs.fasta > No_Am_seqs_NoDups.fasta

Then de-gapped my seqs in two steps with:
sed ‘/^>/! s/-//g’ NoAm_seqs_NoDups.fasta
which removed gaps from within seqs & then:
sed ‘/^>/! s/-$//g’ NoAm_seqs_NoDups.fasta > NoAm_seqs_final.fasta
which removed gaps from the ends of seqs.

All the above steps were done in a bash shell, and processed super quick. Everything imported fine into QIIME2, I trained it on naive bayes (~18 hours), and seems to work like a charm on my seqs and a ref mock community as far as I can tell.

One issue I never encountered was the non-IUPAC characters outside of just ‘-’.

This is the first time I’m coming across RESCRIPt… Seems like it can offer a lot more QC than my pipeline, so I think I’ll start to check this out now too.

Cheers,
Nick

2 Likes

@SoilRotifer @Nicholas_Bokulich
I spent this morning trying to figure out what values might be appropriate for COI length/homopolymer-based filtering.

I was shocked (yet not surprised, if that makes sense) that @SoilRotifer's default homopolymer setting would have been just about perfect. I took the entire raw BOLD COI dataset (this isn't dereplicated yet) and calculated the number of sequences that contained at least one homopolymer of length N (ranging from 5 to 15). For even a single sequence, there can (and often are!) multiple homopolymers - I didn't count these separately - they all just get rolled into a binary tally of "yes, there was at least one homopolymer of some A|T|C|G character, of length N", or "nope".

In the plot below, you can see that nearly all of our sequences have at least one 5,6,7, or 8-mer homopolymer, but vastly fewer have 9+mers. There's another large drop after 11mers. Notably, these do not include ambiguous characters or gaps. The only homopolymers counted were for ATCG characters.

Curious how you think this information would inform a filtering strategy. Would you think the 9-mer (or more) threshold is appropriate?

image

I also wanted to investigate what kind of length-based filtering parameter was useful, but I thought that maybe having the leading/trailing ambiguous characters might skew any interpretation of sequence length. So, I calculated sequence lengths for the raw data, and then again for the data where I removed only the leading/trailing N's (internal N's like in "AATCGnATC" were retained).

In the figure below, the bulk of the COI records fall into the 500-650 bp range. You can see that the shape for both the N-trimmed data and the raw untrimmed sequences are similar. @SoilRotifer - I still think it' useful to do this kind of trimming first before applying the current method of filtering ambiguous bases and hopefully that kind of patch could be available within RESCRIPt down the line, even if it's not overly impactful on this COI data.

The dotted vertical line is set at 650: that 500ish spike represents mainly the amplicons generated by the Folmer primer set, while the second peak around 650ish represents the other set from Hebert/Ivanova and others. BOLD tends to want folks submitting sequences at least 500 bp in length, so this makes sense to me. Curious to see that there was a smattering of longer stuff in there too. This might be from folks submitting either the entire full-length COI, or perhaps sequencing before/after/through the gene. It's not clear, because I haven't done any kind of massive alignment of all the data to ensure it's all COI (that was what BOLD is supposed to do!). Makes me think that it could also be worth tossing sequences beyond a certain length.

My take would be that filtering some minimal length is useful, but it's not going to result in a massive loss of sequences. I tabulated how a variety of length thresholds would impact the retention of sequences, and it looks like most things under even 250 bp are retained. The problem is, I don't know whether or not those shorter sequences are all specific to a certain COI primer set (maybe all the jellyfish get amplified with 200 bp sequences?). My sense is that tossing out things under 200 or 250bp is safe. Maybe it's not worth it?

Min_len(bp) N-trimmed untrimmed
50 1 1
100 1.00 1.00
150 0.999 0.999
200 0.998 0.998
250 0.995 0.995
300 0.992 0.992
350 0.984 0.984
400 0.978 0.978
450 0.963 0.964
500 0.940 0.941

thanks for your thoughts

2 Likes

This is amazing @devonorourke!

Haha I was close!

I think homopolymers are a weird beast to deal with for COI data. I know for many microfaunal / meiofaunal invertebrates, like rotifers, have homopolymers that are not sequencing artifacts. Much of my sequence data (Sanger) had a legitimate stretch of ~ 11 Ts in my COI amplicon. This, combined with your plot, strongly suggests that the homopolymer length-filtering should be set to ~12 in my opinion.

I completely agree that allowing for additional options to handle leading / trailing IUPAC ambiguity codes would be quite useful. I have some ideas on how to implement this.

Well, there are several ways / orders in which we can go about making classifiers. It might be best to toss out short reads after extraction the amplicon region of interest. But if you want to construct a full-length classifier, it might be worth keeping them.

-Mike

1 Like

@SoilRotifer - Any concerns about primer-based trimming ending up dropping sequences that were submitted with those primer regions trimmed (prior to submission?).

Not sure how likely that will be here. Maybe I need to try it with and without?

Yes! This is a very big concern for me, and also relates a little, to what I’ve mentioned here. We are planning to implement an alignment based extraction approach. One way to do this, is identical to what I’ve done with my prototype SILVA parser, see step 7.

I have a few other ideas up-my-sleeve… :smirk:

-Mike

2 Likes

Hi @SoilRotifer,

A couple of questions about the alignment work:

First, I might be doing something terribly wrong, but it looks like the first thing I need to do with my fasta file is align it. I tried just running a subset of my COI sequences that had gaps removed and were homopolymer/ambiguous N's removed (COI_subsetSeqs.fasta), but this failed...

mafft \
> --addfragments FolmerCOI_primers.fasta \
> --mapout COI_subsetSeqs.fasta > FolmerCOI_primers_aln_map.txt
nadd = 2
Not aligned!

I thought maybe this is because my input isn't an alignment file itself, but simply fasta sequences? There are 10,000 sequences in the COI_subsetSeqs.fasta file, in this kind of format:

>10001951
GACCCTATACTTCCTTCTAGGTATTTGATCCGGACTAACCGGCACATCCATAAGACTTTT...{and more ATCG}...
>10004549
AACACTTTATTTTATTTTTGGAATTTGAGCAGGAATAGTAGGATCATCCTTAAGATTACT...{and more ATCG}...

It looks like you were running your examples using a dataset that was previously aligned. I figured that I should then just try aligning this subset of sequences then, so I ran those 10k COI sequences through MAFFT:

mafft --thread 4 --threadtb 5 --threadit 0 --inputorder --anysymbol \
--auto COI_subsetSeqs.fasta > COI_subsetSeqs_mafft.out

and then used that alignment output as the input instead for the same (previously failed) mafft bit earlier:

mafft --addfragments FolmerCOI_primers.fasta \
--mapout COI_subsetSeqs_mafft.out > FolmerCOI_primers_aln_map.txt

And then it worked, but that led me to the next question... Can you please elaborate a bit on how I would go about inferring the FolmerCOI_primers.fasta.map output from mafft?

In your github post, you write:

You can sanity-chack the alignment in your favorite alignment viewer. Anyway, this looks good, lets extract amplicon region from the original alignment. Note the values I use for the start and end position of the alignment

It looks like you've defined the positions in the next code block as the interior flanking position of the primers. What exactly is the sanity check? My output, below, seems to me to suggest that these two primers are spanning > 1000 bp apart... that doesn't seem right. I'd expect an ~ 700 bp fragment:

>Forward_LCO1490_GGTCAACAAATCATAAAGATATTGG
# letter, position in the original sequence, position in the reference alignment
g, 1, 53
g, 2, 54
t, 3, 55
c, 4, 56
a, 5, 57
a, 6, 58
c, 7, 59
a, 8, 60
a, 9, 61
a, 10, 62
t, 11, 63
c, 12, 64
a, 13, 65
t, 14, 66
a, 15, 67
a, 16, 68
a, 17, 69
g, 18, 70
a, 19, 71
t, 20, 72
a, 21, 73
t, 22, 74
t, 23, 75
g, 24, 76
g, 25, 77
>Reverse_HCO2198_TAAACTTCAGGGTGACCAAAAAATCA
# letter, position in the original sequence, position in the reference alignment
t, 1, 1236
a, 2, 1237
a, 3, 1238
a, 4, 1239
c, 5, 1240
t, 6, 1241
t, 7, 1242
c, 8, 1243
a, 9, 1244
g, 10, 1245
g, 11, 1246
g, 12, 1247
t, 13, 1248
g, 14, 1249
a, 15, 1250
c, 16, 1251
c, 17, 1252
a, 18, 1253
a, 19, 1254
a, 20, 1255
a, 21, 1256
a, 22, 1257
a, 23, 1258
t, 24, 1259
c, 25, 1260
a, 26, 1261

Hi @devonorourke

By performing a

sanity-check the alignment in your favorite alignment viewer.

I basically mean that you should visually inspect how well the primers align to your sequences. To make sure they are being mapped correctly. Often this will not be a problem, but I have encountered cases when my primers did not align well and I have to fiddle with alignment parameters or make manual adjustments with an alignment editor. I typically view my alignments with Unipro UGENE.

Also, visualizing the alignment will help explain why the primers appear 1000 alignment positions apart, i.e. not 1000 bp apart. The numbers on the right-most column denote the alignment column positions, the length of the alignment will be longer than the sequences due to any indels added to make the alignment. If you look at the positional values of the 16S V4 primers in my tutorial you'll see that a single primer sequence spans ~1,800 alignment positions! The SILVA alignment is 50,000 alignment positions, lots of indels.

-Mike

Great, thanks @SoilRotifer.

I realized after fighting a bit with the data last night that I was going to get caught needing to circle back and use an aligned dataset from the outset. I had started my COI database workflow by removing gaps and the leading/trailing N's, but it occurred to me as I was running out of coffee (read: brain power) that I couldn’t use the quality-filtered fasta file any more to do this particular work mapping primers and then coordinate-based filtering.

My lingering question/concern is whether or not I can do this coordinate-based filtering in the first place. While it looks like the raw BOLD sequence are aligned to each other, the sequences with gaps aren’t even all the same length. So they’re probably aligned initially, but then truncated to some region by BOLD. argh.

I think I’ll have to go back to the drawing board and figure out a way to invoke this coordinate-based filtering approach with a smaller dataset (unless I can quickly align 3.5 million sequences…?). I suppose I could try to dereplicate these data first? In my experience previous BOLD database work, dereplicating reduced about 3 million records to 1 million. Then perhaps I could try running MAFFT to create a proper alignment file, and then try this coordinate filtering?

Thanks

Oh... this is bringing back nightmares. :japanese_goblin:

Yes, quite the mystery. :mag_right:

I think using your ~ 1 million sequence set, for an alignment, might be worth a shot.

-Mike

1 Like

Any particular recommendations for tools that are used to tackle such a massive data set? Is MAFFT the best approach? I have no experience on this front.

I think mafft is great for this. Given the number of sequences in your data set you may need to enable the --p-parttree option.

I'll give it a go @SoilRotifer.
Looking into the mafft manual a bit, and it looks like I can create these alignments fastest with:

mafft --retree 1 --maxiterate 0 --nofft --parttree

There are several solutions that MAFFT recommends with big datasets, so if anyone else has a sense of other parameters that would be sensible for aligning 2.3 million sequences (that's how many are in the dereplicated file!), I'd love to hear it.

My concern is that this number of sequences is going to be impossible to fit into a temporary disk. From a MAFFT page about parallelization:

Size of the data stored in the temporary directory is proportional to N2, where N is the number of sequences.
It also depends on the similarity. In one case, when N∼80,000, the data size ∼150GB. Be careful about disk space and quota.

Maybe parallelization is another option? Trying not to have to run anything on our cluster with MPI... that wasn't fun the last time I tried with genome annotations...

1 Like

Quick update @SoilRotifer and @Nicholas_Bokulich,
The mafft alignment failed after 12 days. Not clear what to make of the error message (see below). It looks like there was sufficient memory to finish all the alignment tasks, but something failed when it went to create a temporary file… I posted the same log message to mafft online so hopefully they’ll have some hints.

Any other thoughts on how to possibly build up an alignment by subsampling? I wondered if maybe I should somehow cluster my sequences at something like 97% (which would reduce them to probably about 200k sequences instead of 1800k) and align just those sequences first. Then, take the remaining ~1600k unaligned sequences that weren’t clustered and align them to the aligned dataset. I think this might work with mafft –add?

Thanks for any tips or tricks you can think of!

m Alignment 1818033-890
Done.



----------------------------------------------------------------------------

nseq = 1818923
groupsize = 1818924, partsize=50

----------------------------------------------------------------------------
splittbfast (nuc) Version 7.471
alg=M, model=DNA200 (2), 1.53 (4.59), -0.00 (-0.00), noshift, amax=0.0
1 thread(s)

mv: cannot stat `hat3.seed': No such file or directory

Strategy:
 FFT-NS-DPPartTree-1 (Not tested.)
 ?

If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version 7.110 (2013 Oct).
It tends to insert more gaps into gap-rich regions than previous versions.
To disable this change, add the --leavegappyregion option.

/scratch/dro49/conda/envs/rescript/bin/mafft: line 2687: /tmp/dro49/31595547/mafft.o4dmpecEvW/pre: No such file or directory  

Hi Devon,

Based on:

Are running this through another script? mv (move) is a unix command, and it is apparently unable to move files.

There are a couple things you can try:

  • If you want to stay within QIIME 2 then I think your idea of using maft-add is a good start. However, I'd start that off by making your first alignment from a specific clade.
    • Then add new sequences to that alignment by clade.
    • This will usually help make the process quicker.
  • External to QIIME 2, you can run mafft (or other tool) manually, to merge separate alignments together (sometimes called profile-profile alignments)
    • Again, make separate alignments by clade, then merge them.

There are probably other ways, but those are two off the top of my head. :slight_smile:

-Mike

1 Like

Hey @SoilRotifer,
What do you think about performing the whole process on just clade-specific groups?
So, rather than creating a gigantic alignment file of all animal COI from BOLD, then aligning primers to find the coordinate boundaries, then removing the gaps…

we just perform clade-specific alignments, aligning the primer set to those clade-specific alignments, degapping those alignments… then merging all the clades?

That should in principle be faster, right (parallelization notwithstanding). Wouldn’t aligning within clades be faster?

2 Likes

I love this idea! If the goal is just to find primer binding sites based on alignment, a massive global MSA is not necessary.

The only shortcoming I see is if all sequences in a clade fail to match the primer because all sequences are downstream of the primer site (maybe that primer was used to amplify all of those sequences, but was removed from all prior to uploading to BOLD? a distant possibility)

Another possibility: since the issue here is that extract-reads relies on a primer being inside a sequence to trim and keep, what if you use that method as a first pass, then align all failures (by clade) will a representative of the clade that has the primer sequence (pre extract-reads)? This might help you pass a significant number of sequences before resorting to alignment.

3 Likes

That is way, way smarter.
I’ll give that a go and report back

2 Likes

@devonorourke, I second everything that @Nicholas_Bokulich, outlined. Great idea!

-Mike

1 Like

It turns out the alignment process can be dramatically simplified without invoking any clade-specific trickery. However, my initial foray into this new strategy has an element of stochasticity to it that I haven't worked out yet. All of these ideas come straight from the mafft developer, and we're working on identifying how to get rid of the randomness (of the success or failure) of this technique. Here's the general approach:

  1. Gather a subset of sequences of the giant reference database, then build a tiny alignment using just that subset. Something like this should work:
## get just 2,000 sequences from the original fasta file:
seqkit sample --rand-seed 101 --number 2000 --two-pass all_COI_refSeqs.fasta -w 0 > subset_COI_refSeqs.fasta

## make the alignment using just those 2,000 sequences
mafft --auto --thread -1 subset_COI_refSeqs.fasta > reference_MSA

That job takes under a minute. :horse_racing:t3: !

  1. The next step involves adding the primer sequences to that small reference_MSA alignment file. The anml_primers.fasta file contains the forward primer sequence, and the reverse complement of the reverse primer sequence (in fasta format). We add those two primer sequences to the original reference_MSA alignment file and create a new alignment file with the primers added to it. Why? To make sure our primers are going to map to the reference_MSA file in the right positions. Like this:
mafft --auto --addfragments anml_primers.fasta --keeplength --thread -1 --mapout --reorder reference_MSA > ref_primer

This step also takes only a minute to complete. :racing_car: !

  1. Now the heavy lifting. We add all the remaining sequences to that primer_ref alignment file. Adding the --keeplength term is what makes this whole process work vastly faster. From the developer directly:

The "--keeplength" option is a key. It inserts gaps in an asymmetric manner:

  • If an insertion is in reference MSA, then the corresponding position in non-reference sequence is normally gapped, as in standard alignment calculation.
  • If an insertion is in a non-reference sequences, then the insertion is just removed. No gap is inserted into the reference MSA.
  • As a result, the output, consisting of (reference MSA + ~1.8 million non-reference sequences), has the exactly same length as the reference MSA.

We run this code to get this job done:

## create a list of the sequences to drop:
grep '^>' subset_COI_refSeqs.fasta | sed 's/^>//' > droplist.txt

## filter OUT the sequences to include only those NOT in the droplist.txt file
seqkit grep -v --pattern-file droplist.txt all_COI_refSeqs.fasta > seqsForGiantAlignment.fasta

## massive alignment time!
mafft --auto --addfull seqsForGiantAlignment.fasta --keeplength --thread -1 ref_primer > giant_alignment

This job... also fast! :racing_car: :horse_racing: :boom: ! Instead of taking 12+ days to complete, it finishes in under an hour :exploding_head:


Okay, so the bad news? The primers don't always align exactly where you'd expect them to. When I randomly subset those first 2,000 sequences to create the reference_MSA file, I can use a bunch of different random seeds to generate different sets of 2,000 sequences, and those reference sequences don't always produce a primer_ref alignment file with the primers correctly aligned to where I'd expect them to be. That's the stochastic part I haven't worked out yet.

Nevertheless, the guts of the process are there for users to rapidly resolve these huge alignment issues. Onward and yonward! (we're watching a lot of P.B.S. in the mornings in our household)...

2 Likes