I have used the RESCRIPt plugin to import nifH nucleotide data from NCBI to create a classifier for this gene, and am attempting to quality-filter the data so that it's more accurate when run against my representative sequences.
For the Silva tutorial, I noticed that there is a dereplicate option before and after making an amplicon-specific classifier, using specific commands such as:
--p-rank-handles 'silva' \
--p-mode 'uniq' \
I tried to dereplicate my data, keeping the --p-mode function the same but switching out the 'silva' for 'ncbi' in --p-rank-handles, but both commands caused errors, so I'm assuming these commands are specific to Silva, Greengenes, etc. If so, why don't we need to dereplicate with these setting in NCBI?
I also noticed that this command:
--p-perc-identity 97
Also doesn't work when I try to dereplicate. Why is this not necessary/used with NCBI data?
Just trying to understand the ins-and-outs of this stuff for my own sanity. Any clarification would be deeply appreciated. Thank you!!!
I'd send the errors to you now but someone else is using the computer the errors are posted on
They do roughly state that:
the --p-rank-handles command is not compatible with 'ncbi', so in this case and based on the COI NCBI tutorial I can keep using the 'silva' input instead or just 'disable', but I think I'll stick with the former
the --p-perc-identity 0.99 is not a usable function, which leads me to think that NCBI data isn't formatted so that you can use this command, but in the NCBI COI tutorial when they used this command they had to set the --p-mode to 'lca' for "computed consensus taxonomy"...so I might try that
I will try rerunning the code with just 'silva' and include the --p-perc-identity in the --p-mode 'lca' and see if that works...