Add support for `vsearch --derep_prefix`

colinbrislawn · April 25, 2018, 4:43pm

Good morning,

How should I implement the vsearch --derep_prefix (#24) command to be stylistically consistent with Qiime 2?

Should this command be a new method to the vsearch plugin, or an optional flag to dereplicate-sequences?

New method: qiime vsearch dereplicate-sequences-prefix
- ~~Maybe change other method to qiime vsearch dereplicate-sequences-full-length?~~ Or we keep it to signify full-length dereplication as the primary method and prefix as secondary method.
New method: qiime vsearch cluster-features-prefix
- Prefix-dereplication merges non-identical sequences, so maybe we reframe it as a clustering method.
Existing method:
- New option --prefix
- Dereplicate a shorter sequence if it is an exact prefix of a longer sequence.

For sequenced that cluster by derep_prefix, should we report the sha1 hash? Currently

Feature identifiers in the resulting artifacts
will be the sha1 hash of the sequence defining each feature.

I think we need to report sha1 to be compatible with vsearch clustering, but in --derep_prefix, it's going to be "sequenceS of decreasing length" and these wouldn't have the same hash. This relates to possible name of cluster-features-prefix

Thanks for your thoughts,
Colin

thermokarst · April 25, 2018, 4:58pm

Thanks @colinbrislawn!

Reading through @gregcaporaso's original issue text, he indicates that this should be a new option (parameter in QIIME 2-speak) on the method dereplicate-sequences.

Let's see what @gregcaporaso has to say on the matter.

colinbrislawn · April 25, 2018, 5:42pm

If this is a new option, we have to report the sha1 to preserving clustering. So the question is, do we want to change how we present prefix-dereplication, calling it cluster-features-prefix instead.

And that question, needs export advice. Besides Greg, should we ask Rognes and Mahe about this?

gregcaporaso · April 26, 2018, 11:11am

@colinbrislawn, I've been thinking that this should be a parameter of the existing method, and the sha1 of the longest sequence (which is still the sequence defining the feature) would become the feature id. My understanding is that that's what vsearch does. I think this is ok as it wouldn't group sequences together that we know are different from each other, which is consistent with the current behavior. For example, if lower case letters are bases that we haven't sequenced, deprelicate-sequences would group the following two sequences, because as far as we know, they're the same:

s1: AAACCCg
s2: AAACCCc

This is similar to grouping the following when using the prefix parameter:

s1: AAACCCg
s2: AAACCg
s3: AAACCc

As far as we can tell, the sequences are the same, so we group them all. I don't love this behavior, which is why I don't think we should have it enabled by default, but I do think it makes sense. In this case, the sha1 of s1 would become the feature id, and s2 and s3 would be grouped with s1.

colinbrislawn · April 26, 2018, 10:05pm

I agree; I don't love this pseudo-dereplication either, but that's what vsearch/usearch does and this maintains compatibility with our OTU picking methods.

It makes sense to compare prefix dereplication to masking. Well said!

Colin

colinbrislawn · July 26, 2018, 5:06pm

PR in progress: ENH: adds --derep_prefix dereplication parameter by colinbrislawn · Pull Request #54 · qiime2/q2-vsearch · GitHub

This is a humbling reminder and I know very little about Python.