Importing BOLD COI sequence data into QIIME 2 / RESCRIPt

devonorourke · July 7, 2020, 1:31am

@Nicholas_Bokulich @SoilRotifer,
I have the raw BOLD COI data in hand now, and in following this tutorial I'm finding a couple of things about manual import being a bit different than what's done with the Silva dataset.

In this tutorial, you mention the user would import a fasta file into a .qza format. There are two hurdles I'm facing with the COI data being rejected using qiime tools import:

These raw data contain gaps, thus I get this error message:

There was a problem importing fake.fasta:

  fake.fasta is not a(n) DNAFASTAFormat file:

  Invalid character '-' at position 544 on line 4 (does not match IUPAC characters for a DNA sequence).

These raw data contain a very few (42 out of over 3 million) instances where there are exact duplicate sequence/taxonomic records. No idea how they got generated, but I can confirm the taxonomic string and associated sequences are identical. I can't import duplicate entries either:

There was a problem importing fake.fasta:

  fake.fasta is not a(n) DNAFASTAFormat file:

  ID on line 7 is a duplicate of another ID on line 1.

Any recommendations on how to instruct users what to do about these kinds of data? Should I just remove the gaps manually? I could also remove the duplicate records manually too prior to importing into QIIME.

Thanks

SoilRotifer · July 7, 2020, 2:20am

Hi @devonorourke, let's see if we can get you sorted.

devonorourke:

There was a problem importing fake.fasta:

  fake.fasta is not a(n) DNAFASTAFormat file:

  Invalid character '-' at position 544 on line 4 (does not match IUPAC characters for a DNA sequence).

That error suggests to me that you are importing an alignment, which you can do as outlined here. Note the --type 'FeatureData[AlignedSequence]'. Then run rescript degap-seqs to remove the gaps.

Considering that there are only 42 duplicates I'd just remove those manually prior to importing. But it is odd that they exist at all.

-Mike

devonorourke · July 7, 2020, 11:28am

Thanks for the info Mike, much appreciated.
I've resolved the duplicate records and will import the aligned sequence data.

One other bit of info that isn't pressing, but is something I've been wondering about. In addition to the sequence and taxonomy information I pulled from BOLD, I also have information about:

associated GenBank ID (when applicable)
BOLD BIN number
Country origin of sample
Institution submitting sample

I was curious how this might be leveraged in the filtering pipeline in a couple of ways:

Users might be interested in discarding/retaining sequences from specific geographic locations. Maybe you want to keep only those COI records from Latvia and Estonia, or maybe you want just the COI data when a Country is known and discard all others...
There is a string in the Institution submitting sample field that mentions if the particular sequence was imported from GenBank. I'd like to be able to count how many of these entries there are. I can do this in R, but wondered if there would be a way to summarize these data once the objects are in a QIIME environment. Something similar to this in R:

library(readr)
library(dplyr)
metadata_df <- read_csv(file="bold_metadata.csv")
metadata_df %>% 
    group_by(Institution) %>%
    tally()

Nicholas_Bokulich · July 7, 2020, 2:28pm

That's a great idea! You should be able to filter the sequences based on metadata with qiime feature-table filter-seqs.

Alas not in QIIME 2... you could use qiime metadata tabulate to create a searchable QZV from the metadata, but not summarize metadata information. Using R or python to manipulate and summarize metadata is your best bet right now.

devonorourke · July 7, 2020, 8:33pm

@Nicholas_Bokulich @SoilRotifer

A question about applying the qiime rescript dereplicate option...

In my experience with Arthropod COI data from BOLD, I learned that it was worthwhile to remove sequences that contained little taxonomic information prior to conducting the dereplication step. The thinking behind this was encountering a situation where I'd have two identical sequences with these associated taxonomy labels:

seq1    k__:animal,p__:chordate,c__:mammal,o__:bat,f__:tiny,g__:big,s__:big_brown
seq2    k__:animal,p__:chordate,c__:mammal,o__:bat,f__:,g__,:s__:

If I used an LCA approach, then the sequence would collapse to just includ the order, and lose the Family/Genus/Species information.
This gets into that touchy territory of potentially over-classifying something, but I feel like drawing a line at the Family level was a sensible level for these kind of data. What do you think?

My sense was I should drop any records that lack at least Family-rank information, then conduct the dereplication via LCA mode, and retain what's left. I wanted to make sure that wasn't a feature in RESCRIPt anywhere yet, and was likely something I'd need to do to my dataset prior to importing into QIIME as a .qza file.

Nicholas_Bokulich · July 7, 2020, 11:26pm

Hey @devonorourke,

Good news. The wonderful @SoilRotifer just added a new mode to dereplicate... for lack of a better term, this is called super mode (meaning superstring majority LCA dereplication). We had already implemented this in the merge_taxa method and now it's an option for dereplicate too.

This mode does exactly what you are after. It performs LCA, but collapses substrings (like the empty rank handles) into superstrings that contain those substrings. AND it looks for majority classifications first, so that LCA is only performed when there is a tie.

The other option is to filter out sequences that are missing annotations.

devonorourke · July 8, 2020, 12:02pm

Okay guys, another morning, another thing to highlight with processing sequence data!

When I go to run qiime rescript degap-seqs, I hit an error that I forgot about (and resolved a few years ago when doing the Tidybug thing): there are a very few instances in BOLD data where the sequence contains a non-IUPAC charcter (I) and this breaks the program because it's an invalid character.

Minor request - it would be great if the particular offending sequence would be flagged in the error message in the .log file.

I was wondering what strategy would make sense here.

Allow the character, but convert it to an N
Delete the sequence

For what it's worth, there was just a single record that contained this weird character when I did this a few years ago, so I don't think it's entirely pervasive, but who knows what kind of strange crap would come up as others use tools to acquire data and import it into QIIME, then try to follow along with these methods.

I think maybe it makes sense to have another tool outside of qiime rescript degap-seqs that would catch these instances. My sense is that qiime tools import should be able to catch these non-IUPAC characters even when putting in a feature type of alignment data. In an ideal world, the user would import their data with qiime tools import, and then any sequence that contains non-IUPAC sequences would be sent to a "not imported" folder so that the user can inspect and better understand what strangeness is going on.

I'm sorry to keep throwing up roadblocks. Thanks

SoilRotifer · July 8, 2020, 1:08pm

Thanks @devonorourke,

Yeah I recall our last encounter with this. It still baffles me that a sequence repository would not follow standard IUPAC rules for sequence deposition. This issue crops up from time to time with others too.

It might be worth considering replacing Is with Ns, with a replace_non_iupac_with_Ns method. In the short term, users can try bioawk:

bioawk -c fastx 'gsub("I", "N") $seq {print ">" $name; print $seq}' < infile.fasta > outfile.fasta

-Mike

devonorourke · July 8, 2020, 1:13pm

Thanks @SoilRotifer,
Unfortunately the weirdness may run deeper than a single known character, so I'd recommend partitioning sequences into good/badIUPAC directories and let the user figure out the problem thereafter maybe?
In any event, I'll proceed in making this BOLD database today without any QIIME tools addressing the odd characters, but I think it's a nice 'heads up' thing if we want a tool that can be used from a variety of different input sources (BOLD is pretty good, but who knows what other users' sources might be?)

Thanks

Nicholas_Bokulich · July 8, 2020, 2:01pm

qiime import just validates the first few lines for most types, in the interest of time. Fully validating some data types can be quite extensive! So it is up to the user to fully validate these types — please use qiime tools validate and let us know what you find, it should catch the I characters and give you a more informative error message.

fixing the file before import is probably the way to go... the issue with fixing after importing is that users could still run into type errors. So I recommend just making bioawk or whatever your tool of choice is for cleaning the sequences another step in your BOLD workflow (at least for now!)

devonorourke · July 8, 2020, 2:10pm

Yep, will do.
I'm trying to do everything with common bash utilities so that users won't have to install anything. I think my ugly fix is working:

## getting the line numbers of the bad sequences
grep -v '^>' aligned-dna-sequences.fasta | grep [^THADGRC\.SMBNWVKY-] -n > badseqs.txt

## reformatting these line numbers to delete
for val in $(cut -f 1 badseqs.txt -d ':'); do
myvar1=$(($val*2))
myvar2=$(($myvar1-1))
echo $myvar2"d"
echo $myvar1"d"
done | tr '\n' ';' > droplines.txt

## deleting these lines from original fasta
myval=$(cat droplines.txt)
sed -e $myval aligned-dna-sequences.fasta > cleaned_aligned-dna-sequences.fasta

Nicholas_Bokulich · July 8, 2020, 2:35pm

I'd recommend awk to do a cleaner job... this will convert "I"s to "N"s:

awk '/^>/ {print($0)}; /^[^>]/ {gsub("I","N"); print}' seqswithi.fasta | tr -d ' ' > seqswithouti.fasta

devonorourke · July 8, 2020, 2:39pm

Sure - definitely easier to substitute when you know what you want to replace.

The problem I'm trying to point out is that you don't always know if it's going to be an "I" or something totally bizzare. For instance, it might be a string of integers instead of alphanumeric characters. We don't expect weird crap, but given that we know what kind of characters are good to keep, I wanted a method that was a bit more conservative and just removes any sequences with non-IUPAC characters. There were instances with those "I" characters, and a few dozen other ones where something weird is going on with the R package downloading strings, then parsing some of those fields. In any event, it's all good now.

SoilRotifer · July 8, 2020, 2:53pm

Sounds like this is more of a problem with the tools being used rather than issues with the data itself. I do not think we want to get into the business of cleaning up the output from other tools. That'd be a nightmare.

Yay!

devonorourke · July 8, 2020, 2:58pm

@SoilRotifer - any idea how long it takes for degap-seqs to run? The .py script (in my limited understanding of Python) didn't seem to have any option for multithreading. Is likely that it'll take a few hours for a job of 3 million sequences to wrap up?

The thing I wonder about with trimming ambiguous/homopolymer stretches: does it make sense to do that before or after you trim to the amplicon region you're after?

In the tutorial, I could specify that I'm building a database for a specific primer set, but instruct users how to substitute my primer sequences with their own. If I trim by those regions first, I'm guessing it'll save a lot of time/memory when working through the remaining N's and homopolymer regions. Seem sensible?

SoilRotifer · July 8, 2020, 3:25pm

Hi @devonorourke,

We do indeed have plans to add multi-threading support to several of the RESCRIPt methods. But we wanted to sync our beta release with the QIIME 2020.6 release. It will take quite a while to run.

You can carry out the order of operations in that tutorial any way you'd like. I think it'd make more sense to trim to your amplicon region of interest first, then proceed to the next steps. So yes, totally sensible!

-Mike

Nick_Gabry · July 8, 2020, 7:53pm

Hello @devonorourke @SoilRotifer ,

I've been running through making my own COI classifier for some NGS outputs I've generated, and as I've gone through the process I've come across some of the same issues you're talking about. I'm not sure if any of this will be helpful or not, but I though I'd throw in what I've done so far to see if it lines up with what you're working towards or not.

My Seqs are from NY, USA, so I made a 'North America' classifier with just Canada and US Arhtropod seqs exported from BOLD.

I erased all dups from my fasta with:
awk '/^>/{f=!d[$1];d[$1]=1}f' No_Am_seqs.fasta > No_Am_seqs_NoDups.fasta

Then de-gapped my seqs in two steps with:
sed '/^>/! s/-//g' NoAm_seqs_NoDups.fasta
which removed gaps from within seqs & then:
sed '/^>/! s/-$//g' NoAm_seqs_NoDups.fasta > NoAm_seqs_final.fasta
which removed gaps from the ends of seqs.

All the above steps were done in a bash shell, and processed super quick. Everything imported fine into QIIME2, I trained it on naive bayes (~18 hours), and seems to work like a charm on my seqs and a ref mock community as far as I can tell.

One issue I never encountered was the non-IUPAC characters outside of just '-'.

This is the first time I'm coming across RESCRIPt.. Seems like it can offer a lot more QC than my pipeline, so I think I'll start to check this out now too.

Cheers,
Nick

devonorourke · July 9, 2020, 4:07pm

@SoilRotifer @Nicholas_Bokulich
I spent this morning trying to figure out what values might be appropriate for COI length/homopolymer-based filtering.

I was shocked (yet not surprised, if that makes sense) that @SoilRotifer's default homopolymer setting would have been just about perfect. I took the entire raw BOLD COI dataset (this isn't dereplicated yet) and calculated the number of sequences that contained at least one homopolymer of length N (ranging from 5 to 15). For even a single sequence, there can (and often are!) multiple homopolymers - I didn't count these separately - they all just get rolled into a binary tally of "yes, there was at least one homopolymer of some A|T|C|G character, of length N", or "nope".

In the plot below, you can see that nearly all of our sequences have at least one 5,6,7, or 8-mer homopolymer, but vastly fewer have 9+mers. There's another large drop after 11mers. Notably, these do not include ambiguous characters or gaps. The only homopolymers counted were for ATCG characters.

Curious how you think this information would inform a filtering strategy. Would you think the 9-mer (or more) threshold is appropriate?

I also wanted to investigate what kind of length-based filtering parameter was useful, but I thought that maybe having the leading/trailing ambiguous characters might skew any interpretation of sequence length. So, I calculated sequence lengths for the raw data, and then again for the data where I removed only the leading/trailing N's (internal N's like in "AATCGnATC" were retained).

In the figure below, the bulk of the COI records fall into the 500-650 bp range. You can see that the shape for both the N-trimmed data and the raw untrimmed sequences are similar. @SoilRotifer - I still think it' useful to do this kind of trimming first before applying the current method of filtering ambiguous bases and hopefully that kind of patch could be available within RESCRIPt down the line, even if it's not overly impactful on this COI data.

The dotted vertical line is set at 650: that 500ish spike represents mainly the amplicons generated by the Folmer primer set, while the second peak around 650ish represents the other set from Hebert/Ivanova and others. BOLD tends to want folks submitting sequences at least 500 bp in length, so this makes sense to me. Curious to see that there was a smattering of longer stuff in there too. This might be from folks submitting either the entire full-length COI, or perhaps sequencing before/after/through the gene. It's not clear, because I haven't done any kind of massive alignment of all the data to ensure it's all COI (that was what BOLD is supposed to do!). Makes me think that it could also be worth tossing sequences beyond a certain length.

My take would be that filtering some minimal length is useful, but it's not going to result in a massive loss of sequences. I tabulated how a variety of length thresholds would impact the retention of sequences, and it looks like most things under even 250 bp are retained. The problem is, I don't know whether or not those shorter sequences are all specific to a certain COI primer set (maybe all the jellyfish get amplified with 200 bp sequences?). My sense is that tossing out things under 200 or 250bp is safe. Maybe it's not worth it?

Min_len(bp)	N-trimmed	untrimmed
50	1	1
100	1.00	1.00
150	0.999	0.999
200	0.998	0.998
250	0.995	0.995
300	0.992	0.992
350	0.984	0.984
400	0.978	0.978
450	0.963	0.964
500	0.940	0.941

thanks for your thoughts

SoilRotifer · July 9, 2020, 4:40pm

This is amazing @devonorourke!

Haha I was close!

I think homopolymers are a weird beast to deal with for COI data. I know for many microfaunal / meiofaunal invertebrates, like rotifers, have homopolymers that are not sequencing artifacts. Much of my sequence data (Sanger) had a legitimate stretch of ~ 11 Ts in my COI amplicon. This, combined with your plot, strongly suggests that the homopolymer length-filtering should be set to ~12 in my opinion.

I completely agree that allowing for additional options to handle leading / trailing IUPAC ambiguity codes would be quite useful. I have some ideas on how to implement this.

Well, there are several ways / orders in which we can go about making classifiers. It might be best to toss out short reads after extraction the amplicon region of interest. But if you want to construct a full-length classifier, it might be worth keeping them.

-Mike

devonorourke · July 9, 2020, 5:14pm

@SoilRotifer - Any concerns about primer-based trimming ending up dropping sequences that were submitted with those primer regions trimmed (prior to submission?).

Not sure how likely that will be here. Maybe I need to try it with and without?