RESCRIPt question: Can you/should you dereplicate NCBI data?


I have used the RESCRIPt plugin to import nifH nucleotide data from NCBI to create a classifier for this gene, and am attempting to quality-filter the data so that it's more accurate when run against my representative sequences.

For the Silva tutorial, I noticed that there is a dereplicate option before and after making an amplicon-specific classifier, using specific commands such as:

--p-rank-handles 'silva' \
--p-mode 'uniq' \

I tried to dereplicate my data, keeping the --p-mode function the same but switching out the 'silva' for 'ncbi' in --p-rank-handles, but both commands caused errors, so I'm assuming these commands are specific to Silva, Greengenes, etc. If so, why don't we need to dereplicate with these setting in NCBI?

I also noticed that this command:

--p-perc-identity 97

Also doesn't work when I try to dereplicate. Why is this not necessary/used with NCBI data?

Just trying to understand the ins-and-outs of this stuff for my own sanity. Any clarification would be deeply appreciated. Thank you!!!

What are the errors?

The --p-rank-handles simply sets the expected default prefix style, you can checkout the code here:

rank_handles = {
'silva': [' d
_', ' p__', ' c__', ' o__', ' f__', ' g__', ' s__'],
'greengenes': ['k__', 'p__', 'c__', 'o__', 'f__', 'g__', 's__'],
'gtdb': ['k__', 'p__', 'c__', 'o__', 'f__', 'g__', 's__'],
'disable': None,

You can, of course, disable these using --p-rank-handles 'disable'.

Dereplication has always been optional, but it is often recommended to keep the database size small and remove redundant information.

Paste the error output here. Have you worked through this COI NCBI tutorial? You'll see that rescript dereplicate is used there.

What ranks did you choose while downloading data from NCBI? Can you list all commands used prior to the dereplication step?


Thanks for the quick response @SoilRotifer !

I'd send the errors to you now but someone else is using the computer the errors are posted on :sweat_smile:

They do roughly state that:

  1. the --p-rank-handles command is not compatible with 'ncbi', so in this case and based on the COI NCBI tutorial I can keep using the 'silva' input instead or just 'disable', but I think I'll stick with the former

  2. the --p-perc-identity 0.99 is not a usable function, which leads me to think that NCBI data isn't formatted so that you can use this command, but in the NCBI COI tutorial when they used this command they had to set the --p-mode to 'lca' for "computed consensus taxonomy" I might try that

I will try rerunning the code with just 'silva' and include the --p-perc-identity in the --p-mode 'lca' and see if that works...

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.