merge sequences without dropping identical sequences

devonorourke · August 10, 2020, 1:51pm

I'm curious if FeatureIDs might overlap between separate datasets when I pull records from NCBI using different filtering records. For instance, let's say I was going to gather all the NCBI data for COI records that are differentiated based upon the --p-query parameter such that I get data that includes or does not include the "BARCODE"[KYWD] label. Thus, I'd have two datasets:

NCBI_barcodeYES
NCBI_barcodeNO

I then run a series of tests in RESCRIPt to evaluate their taxonomic entropy, sequence entropy, etc. Maybe I hog all of my compute cluster building a classifier for a week. It'll be just fine, because those two datasets are kept separate...

But what if I want to combine those databases, and evaluate if their combination produces a result different from the two original/separated databases? For instance, maybe I wanted to see if my taxonomy/sequence entropy values would change when the two datasets were combined.

Initially, I thought I would just combine them with qiime feature-table merge-seqs, but the challenge here is that any instance where an identical sequence exists, the outcome is to retain the first of the pair. I don't want to do that, because it might be that in some instances the barcodeYES taxonomic information contains more of a description than barcodeNO data, and in other instances it might be the reverse.

What I'd rather have happen is that I combine both datasets without any initial filtering for identical sequences - just keep all the records.

Instead, I'd like to be able to keep all of those combined records (barcode_YES and barcodeNO combined together), and run qiime rescript dereplicate.

My concern, however, is that I'm not clear how the FeatureIDs are created in the first place. It could be possible, I thought, that duplicate FeatureIDs might be generated in the barcodeYES and barcodeNO datasets that have nothing to do with each other? Maybe FeatureID 1001 in barcodeYES represents a sequence for some butterfly , and FeatureID 1001 in barcodeNO is a totally different sequence that represents some fish ?

To conclude, the goal is to:

combine different get-ncbi-data datasets, without losing any information
dereplicate those datasets with the hope that no redundant FeatureIDs are persent when the two get-ncbi-data datasets were initially combined

Thanks for the help with this @BenKaehler @Nicholas_Bokulich @SoilRotifer @thermokarst and all other QIIME magicians !

Nicholas_Bokulich · August 10, 2020, 3:33pm

Hi @devonorourke,

I think the first step is to poke around a little bit to figure out what the feature IDs look like and how likely you are to have a namespace clash. I expect that the following situation should be rare and the accession IDs should be unique, but it's worth taking a look:

But I think I have the solution for you that will:

and
3. if there are any namespace clashes (which there shouldn't be!) it will be pretty easy to figure it out based on the output.

The solution is to use rescript merge-taxa. That method will merge the taxonomies, using a variety of "modes" to sort out overlap (e.g., using LCA if there are 2+ entries with the same feature ID). You could then inspect the result to see if any taxonomies were truncated unexpectedly (indicating a significant namespace clash, like and with the same ID). If all worked well, just use qiime feature-table merge-seqs and proceed.

devonorourke · August 10, 2020, 3:46pm

Thanks @Nicholas_Bokulich!

Is the idea here that there would be multiple FeatureIDs represented in that merged file? Would I just export the merged taxonomy .qza and then inspect for duplicate FeatureIDs? Or perhaps there is something within QIIME I should be using?

Regarding using rescript merge-taxa:

In your description above, merging with a common featureID means that you're merging things with a common sequence as the feature, correct? Not the numeric identifier like 10001758, right?

As an example, say I know that there is an identical Feature ID name in the combined data set, but those have different sequences. These are named 1001 below. Given the following records, I'd expect both to be retained because even though they have identical Feature ID names, they have different sequence compositions:

>1001
AAAAACAAAAA
>1001
AAAAACTTTTT

Is that correct?

Nicholas_Bokulich · August 10, 2020, 3:56pm

No, it merges on feature ID. rescript dereplicate would merge based on identical (or similar within threshold) sequence.

No that's bad news... if you have a namespace clash like that it's bad form on the part of the database creators (i.e., whoever created the database should have used more uniquely identifying feature IDs than "1001").

So the hope is that you don't have accidental feature ID repetition like this when merging your taxonomies and sequences. They shouldn't exist... and rescript merge-taxa will allow you to identify any such clashes (because using LCA mode will cause totally disparate taxonomies to be truncated).

What it would fail to detect is if you have identical feature IDs, identical (or similar) taxonomies, but totally different sequences. So hopefully your databases use uniquely identifying feature IDs for unique sequences!

devonorourke · August 10, 2020, 4:17pm

Perfect, thanks. Hopefully the featureID names when gathering data from NCBI won't conflict. I'll have to wait until at least tomorrow morning to find out though!

system · September 10, 2020, 10:17pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.