I’m curious if FeatureIDs might overlap between separate datasets when I pull records from NCBI using different filtering records. For instance, let’s say I was going to gather all the NCBI data for COI records that are differentiated based upon the --p-query
parameter such that I get data that includes or does not include the "BARCODE"[KYWD]
label. Thus, I’d have two datasets:
- NCBI_barcodeYES
- NCBI_barcodeNO
I then run a series of tests in RESCRIPt to evaluate their taxonomic entropy, sequence entropy, etc. Maybe I hog all of my compute cluster building a classifier for a week. It’ll be just fine, because those two datasets are kept separate…
But what if I want to combine those databases, and evaluate if their combination produces a result different from the two original/separated databases? For instance, maybe I wanted to see if my taxonomy/sequence entropy values would change when the two datasets were combined.
Initially, I thought I would just combine them with qiime feature-table merge-seqs
, but the challenge here is that any instance where an identical sequence exists, the outcome is to retain the first of the pair. I don’t want to do that, because it might be that in some instances the barcodeYES
taxonomic information contains more of a description than barcodeNO
data, and in other instances it might be the reverse.
What I’d rather have happen is that I combine both datasets without any initial filtering for identical sequences - just keep all the records.
Instead, I’d like to be able to keep all of those combined records (barcode_YES
and barcodeNO
combined together), and run qiime rescript dereplicate
.
My concern, however, is that I’m not clear how the FeatureIDs are created in the first place. It could be possible, I thought, that duplicate FeatureIDs might be generated in the barcodeYES
and barcodeNO
datasets that have nothing to do with each other? Maybe FeatureID 1001 in barcodeYES
represents a sequence for some butterfly , and FeatureID 1001 in
barcodeNO
is a totally different sequence that represents some fish ?
To conclude, the goal is to:
- combine different
get-ncbi-data
datasets, without losing any information - dereplicate those datasets with the hope that no redundant FeatureIDs are persent when the two
get-ncbi-data
datasets were initially combined
Thanks for the help with this @BenKaehler @Nicholas_Bokulich @SoilRotifer @thermokarst and all other QIIME
magicians !