dereplicating in RESCRIPt

devonorourke · July 10, 2020, 4:06pm

Hey @Nicholas_Bokulich @SoilRotifer,
When running qiime rescript dereplicate, I was curious what happens in the instance where two strings share overlapping sequence identity, but one sequence is longer than the other. Are both retained? I'm not even concerned about whether or not their taxonomic information is similar/different at the moment.

For example:

>seq1
AAATTTCCCGGG
>seq2
AAATTTCCCGGGAAA

If I was to dereplicate this using the default --p-perc-identity 1, would both sequences be retained? My guess is that without an alignment of these two sequences, you'd have to keep both, right?

SoilRotifer · July 10, 2020, 4:14pm

If you check out the help text via:

qiime rescript dereplicate --help

The sequences should remain separate unless you use the --p-derep-prefix flag. In which case the shortest will be subsumed into the longer sequence.

devonorourke · July 10, 2020, 4:16pm

Awesome - I was reading through the help menu and didn't see that nugget in the --p-mode where I was thinking it might apply. Just to confirm, to activate that option, I'd just pass --p-derep-prefix? Or is it a boolean thing where I need to add "TRUE"?

What do you think @soilrotifer about using that option to reduce the number of sequences in the COI dataset? I'm trying to figure out a way to get the number of seqs down below the current 2.3 million number...

Nicholas_Bokulich · July 10, 2020, 4:30pm

correct

Sounds totally reasonable to me

devonorourke · July 10, 2020, 6:18pm

Nice. Adding that parameter helped.
Starting with about 3.6 million sequences, dereplicating without the --p-derep-prefix crunches it down to about 2.3 million sequences. Adding that prefix reduces it to about 1.8 million sequences. Having to do 500,000 fewer pairwise alignments should help.

system · August 11, 2020, 12:28am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.