Hey @Nicholas_Bokulich@SoilRotifer,
When running qiime rescript dereplicate, I was curious what happens in the instance where two strings share overlapping sequence identity, but one sequence is longer than the other. Are both retained? I’m not even concerned about whether or not their taxonomic information is similar/different at the moment.
For example:
>seq1
AAATTTCCCGGG
>seq2
AAATTTCCCGGGAAA
If I was to dereplicate this using the default --p-perc-identity 1, would both sequences be retained? My guess is that without an alignment of these two sequences, you’d have to keep both, right?
Awesome - I was reading through the help menu and didn’t see that nugget in the --p-mode where I was thinking it might apply. Just to confirm, to activate that option, I’d just pass --p-derep-prefix? Or is it a boolean thing where I need to add “TRUE”?
What do you think @soilrotifer about using that option to reduce the number of sequences in the COI dataset? I’m trying to figure out a way to get the number of seqs down below the current 2.3 million number…
Nice. Adding that parameter helped.
Starting with about 3.6 million sequences, dereplicating without the --p-derep-prefix crunches it down to about 2.3 million sequences. Adding that prefix reduces it to about 1.8 million sequences. Having to do 500,000 fewer pairwise alignments should help.