Check for duplicate id with entire line

A_Bennett · April 15, 2021, 5:52pm

Qiime produces an error message if the id/header is a duplicate when importing a FASTA. It currently looks up to the first white space (from what I can tell). This should be changed to compare the entire length of the header to allow for sequences from one accession over different ranges. The white space formatting after the accession is important for feeding to NCBI via RESCRIPt.

Here is an example of a format pulled for FunGene which fails to import:

'>JNFD01000014 location=90164..91795,organism=Sphingopyxis sp. LC81'
SEQUENCE
'>JNFD01000014 location=00001..00350,organism=Sphingopyxis sp. LC81'
SEQUENCE

Edit it add: maybe a flag is ideal to prevent unintentional uploading of duplicates accessions?

thermokarst · April 15, 2021, 7:09pm

Hi @A_Bennett!

We are using scikit-bio for handling FASTA file parsing (in most places, at least), so we are working off of the skbio interpretation of the FASTA "spec":

http://scikit-bio.org/docs/latest/generated/skbio.io.format.fasta.html#sequence-header

Each sequence header consists of a single line beginning with a greater-than (>) symbol. Immediately following this is a sequence identifier (ID) and description separated by one or more whitespace characters.

So just to make sure I understand, are you proposing that there be an option to include the description field as part of the identifier?

I am concerned that this suggestion might break other things downstream though - if other tools are operating under the assumption the it is >ID DESCRIPTION, this'll just push the error down the road a bit, but, maybe I have misunderstood your suggestion. To be fair, FASTA is one of those squirrely formats with a bunch of different definitions and implementations, so there isn't really a one-size-fits-all approach here, I fear.

Perhaps a solution could be to define a new QIIME 2 format that will do something to convert the entire header line into an ID - one way is to substitute the spaces for underscores. I can provide a more detailed description of what I have in mind, and why, if you're interested.

Let me know!

:qiime2:

A_Bennett · April 16, 2021, 2:03pm

Thank you for the reply @thermokarst

So just to make sure I understand, are you proposing that there be an option to include the description field as part of the identifier?

Yes.

I am concerned that this suggestion might break other things downstream though - if other tools are operating under the assumption the it is >ID DESCRIPTION , this’ll just push the error down the road a bit, but, maybe I have misunderstood your suggestion.

Perhaps a solution could be to define a new QIIME 2 format that will do something to convert the entire header line into an ID - one way is to substitute the spaces for underscores. I can provide a more detailed description of what I have in mind, and why, if you’re interested.

Fair points.
This functionality is important for handling functional genes where multiple copy numbers are expected from one accession number. My current focus is to build classifiers for such genes. Substitution of white space is plausible, but then plugins like RESCRIPt should expect to parse the line.

I preciously spoke with @Nicholas_Bokulich about expanding RESCRIPt to handle multicopy numbers. My script functions with the API interface; duplicate IDs are present with differentiating descriptions. Importing a file.fa is the current road block for command line interfacing. (One caveat, I am using .id() from SeqIO in Bio. It too works by retrieves the first word.) @Nicholas_Bokulich, do you have any thoughts or opinions on the topic?

It occurred to me the multimodal Naive Bayes classifier is in skbio. You are probably right that having duplicate IDs would cause an issue. I have yet to build a classifier with my data output. In light of this, the argument for substitution is stronger.

Nicholas_Bokulich · April 22, 2021, 5:08am

Hi @A_Bennett ,
Just a quick update and apologies for the long silence — this is a QIIME 2 release week so most developers are busy.

We have been discussing this out-of-loop to figure out a long-term solution. I think @thermokarst 's answer already sums up some of the different paths we are considering, but this will not be a quick fix. For the time being as a workaround for you I agree with you/@thermokarst that substituting spaces for underscores would be the easiest fix.