Error when NCBI Refseq is Trained- Qiime2 2020.8

TurboQiimer · November 23, 2020, 1:25pm

Hi,
I downloaded my refseq from NCBI and also FunGene websites. I tried to train both refseq with Qiime2 but it shows the same error!

qiime tools import --type ‘FeatureData[Sequence]’ --input-path sequence.fna --output-path Refseq.qza
There was a problem importing sequence.fna:

sequence.fna is not a(n) DNAFASTAFormat file:

Invalid character ’
’ at position 0 on line 26 (does not match IUPAC characters for a DNA sequence).

I clean the line and position the error shows; it not to keep stopping.
Is it a bug? Is there any solution?

Thanks

andrewsanchez · November 23, 2020, 8:50pm

Hi, @TurboQiimer!

Are you sure you are specifying the correct type? Please see this section of the importing tutorial in which aligned representative FASTA sequences are discussed. Perhaps you need to specify the 'FeatureData[AlignedSequence]' type. If that doesn’t work, would you mind sharing the .fna you are working with?

TurboQiimer · November 25, 2020, 12:02pm

Hi again,
I used type 'FeatureData[Sequence]' for importing SILVA sequence. Anyway, I changed the type then I got the same error! Not fixed.

qiime tools import \

--type 'FeatureData[AlignedSequence]'
--input-path sequence.fna
--output-path RefsequencesdsrB.qza
There was a problem importing sequence.fna:

sequence.fna is not a(n) AlignedDNAFASTAFormat file:

Invalid character '
' at position 0 on line 8 (does not match IUPAC characters for a DNA sequence).

I tried to attach the file, but the file extension is not accepted.

Any idea?

Thanks

TurboQiimer · November 25, 2020, 12:27pm

I also share the seq file format in a photo. There are accession numbers and nucleotide lengths. It sounds OK. It maybe help you.

andrewsanchez · November 25, 2020, 10:18pm

If you zip the file up first, you will be able to attach it.

What did you do to "clean the line?" Did the error message report an invalid character at a different position after that?

I think it's safe to assume that this is not a bug and that the error message is correct: your fasta file is likely corrupt and will need to be modified. If you search for "does not match IUPAC characters for a DNA sequence" on the forum, you will find that this error can arise for a variety of reasons. The solutions others have found may or may not work for you depending on what's wrong with your file. We will need to diagnose precisely what is wrong with your fasta file before prescribing a fix. The first thing that jumps out to me is the blank lines. If you remove the blank lines, are you able to import the file?

TurboQiimer · November 26, 2020, 1:01am

Somewhat yes. The position was 0 but the line changed afterward:

at position 0 on line 8 (does not match IUPAC characters for a DNA sequence).
at position 0 on line 14 (does not match IUPAC characters for a DNA sequence).

TurboQiimer · November 26, 2020, 1:02am

sequence.zip (373.3 KB)

Here you are!

TurboQiimer · November 26, 2020, 1:06am

I removed some blank lines. I could not remove all because there are many lines. The result was changing in line number as you see:
at position 0 on line 238 (does not match IUPAC characters for a DNA sequence).
Would it be a clue?

TurboQiimer · November 26, 2020, 1:08am

I checked it out, but I could not find a solution in my case. I am using recript plugin while the people question were regarding dada2 and so forth on.

andrewsanchez · November 26, 2020, 1:36am

You can remove all of the blank lines using the sed command like so: sed '/^$/d' sequence.fna > sequence_clean.fna. This will create a new file without the blank lines. That file is importable as type 'FeatureData[Sequence]'.

TurboQiimer · November 26, 2020, 1:46am

Thanks. Could you please tell me which environment I should execute this command? In Qiime2 command line interface? or what? You know, I have not seen such characters in Qiime2 commands so far " sed '/^$/d'" that is why I wanted to make sure what is the environment.

If it is Qiime2 plugin, please tell me its address what plugin is there.

By the way, I already saw some invalid characters in this file like repetitive three digit (33333333333) among the AGCT nucleotide. How can I remove them?

Thanks a lot.
Qiimer

TurboQiimer · November 26, 2020, 11:20am

I used this command in Qiime2 command line interface. It worked well.
Thank you very much. It was also imported to .qza extension file.

Thanks a billion, sir.

Qiimer

system · December 27, 2020, 5:20pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.