@SoilRotifer @Nicholas_Bokulich
I spent this morning trying to figure out what values might be appropriate for COI length/homopolymer-based filtering.
I was shocked (yet not surprised, if that makes sense) that @SoilRotifer's default homopolymer setting would have been just about perfect. I took the entire raw BOLD COI dataset (this isn't dereplicated yet) and calculated the number of sequences that contained at least one homopolymer of length N (ranging from 5 to 15). For even a single sequence, there can (and often are!) multiple homopolymers - I didn't count these separately - they all just get rolled into a binary tally of "yes, there was at least one homopolymer of some A|T|C|G character, of length N", or "nope".
In the plot below, you can see that nearly all of our sequences have at least one 5,6,7, or 8-mer homopolymer, but vastly fewer have 9+mers. There's another large drop after 11mers. Notably, these do not include ambiguous characters or gaps. The only homopolymers counted were for ATCG characters.
Curious how you think this information would inform a filtering strategy. Would you think the 9-mer (or more) threshold is appropriate?
I also wanted to investigate what kind of length-based filtering parameter was useful, but I thought that maybe having the leading/trailing ambiguous characters might skew any interpretation of sequence length. So, I calculated sequence lengths for the raw data, and then again for the data where I removed only the leading/trailing N's (internal N's like in "AATCGnATC" were retained).
In the figure below, the bulk of the COI records fall into the 500-650 bp range. You can see that the shape for both the N-trimmed data and the raw untrimmed sequences are similar. @SoilRotifer - I still think it' useful to do this kind of trimming first before applying the current method of filtering ambiguous bases and hopefully that kind of patch could be available within RESCRIPt down the line, even if it's not overly impactful on this COI data.
The dotted vertical line is set at 650: that 500ish spike represents mainly the amplicons generated by the Folmer primer set, while the second peak around 650ish represents the other set from Hebert/Ivanova and others. BOLD tends to want folks submitting sequences at least 500 bp in length, so this makes sense to me. Curious to see that there was a smattering of longer stuff in there too. This might be from folks submitting either the entire full-length COI, or perhaps sequencing before/after/through the gene. It's not clear, because I haven't done any kind of massive alignment of all the data to ensure it's all COI (that was what BOLD is supposed to do!). Makes me think that it could also be worth tossing sequences beyond a certain length.
My take would be that filtering some minimal length is useful, but it's not going to result in a massive loss of sequences. I tabulated how a variety of length thresholds would impact the retention of sequences, and it looks like most things under even 250 bp are retained. The problem is, I don't know whether or not those shorter sequences are all specific to a certain COI primer set (maybe all the jellyfish get amplified with 200 bp sequences?). My sense is that tossing out things under 200 or 250bp is safe. Maybe it's not worth it?
Min_len(bp) |
N-trimmed |
untrimmed |
50 |
1 |
1 |
100 |
1.00 |
1.00 |
150 |
0.999 |
0.999 |
200 |
0.998 |
0.998 |
250 |
0.995 |
0.995 |
300 |
0.992 |
0.992 |
350 |
0.984 |
0.984 |
400 |
0.978 |
0.978 |
450 |
0.963 |
0.964 |
500 |
0.940 |
0.941 |
thanks for your thoughts