FASTA file in .qza has invalid characters

BruceNash · October 15, 2019, 3:13pm

Hi QIIME2 folks. I am getting an error while using a .qza file train a classifier.

The error is pretty obvious and I have seen others like it:

Invalid characters on line 3 (does not match IUPAC characters for a DNA sequence).

The fasta file embedded in the .qza in data has empty lines in between the sequences, which I believe isn't supported by q2, but I am puzzled. The .qza was created and used successfully by a colleague to run the same code. I get the same error when trying to rebuild a .qza file to correct the reference sequences. So, did something change in how .qza files are built or the fasta file is checked?

Also, does q2 simply check the format and, if OK, put it in the data folder within the zipped .qza, or does it rewrite the file (which sounds unwieldy).

Could this be a version issue, as I am using the latest update and he wasn't?

I imagine I need to have a fasta file without the empty lines, but trying to understand why the .qza file doesn't work for me but worked for him.

Also, given the error, can someone suggest a tool to parse my "invalid" fasta file?

Sorry if this falls outside the realm of User Support - trying to teach myself, but clearly not quite there. Feel free to redirect me if this belongs somewhere else.

What I ran (in a Jupyter notebook):

!qiime feature-classifier fit-classifier-naive-bayes \

--i-reference-reads ../nash/Downloads/XXotus.qza

--i-reference-taxonomy ../nash/Downloads/XXref-taxonomy.qza \

--o-classifier XX12s-classifier-no-extract-reads.qza

Plugin error from feature-classifier:

/var/folders/z2/bjsdm3_n4jz_wv6n2c6n1gfc0000gn/T/qiime2-archive-08clw6ud/26b4aa19-28bb-4376-a4ea-42fc13b6446d/data/dna-sequences.fasta is not a(n) DNAFASTAFormat file:

Invalid characters on line 3 (does not match IUPAC characters for a DNA sequence).

Debug info has been saved to /var/folders/z2/bjsdm3_n4jz_wv6n2c6n1gfc0000gn/T/qiime2-q2cli-err-2bs_ji12.log

The top few lines in the fasta file embedded in the .qza showing the empty lines:

1
ACTATGCACAGCCCTAAACTTTGATAGAAACATTACACCCACTATCCGCCAGGGTACTACGAGCTCTAGCTTAAAATCCAAAGGACTTGGCGGTGCTTTAGACCCAC

2
ACTATGCCTAGCCCTAAACATTGGCAACACAAAACACCCGTTGCCCGCCAGGGCACTACGAGCATTAGCTTAAAACCCAAAGGACTTGGCGGTGCTTTAGACCCAC

4
ACTATGCTTAACTGTAAACAAAGATGATAATACACAAACATCATCCGCCAGGGGATTACGAGCAAAGTTTAAAACCCAAAGGACTTGGCGGTGCCTCAAACCCAC

6
ACTATGCCCTGCCGTAAACTTAGATATTTCAATACAACAAATATCCGCCCGGGGACTACGAGCGCCAGCTTAAAACCCAAAGGACTTGGCGGTGCTTCAGACCCCC

Nicholas_Bokulich · October 15, 2019, 10:30pm

Hi @BruceNash,
Yes, QIIME 2 recently upgraded its type validation for FeatureData[Sequence] artifacts, which is why your colleague was able to import this file but you could not.

That your colleague could successfully use that file without raising an error is down to plain luck. At some point or another that file might have caused problems with one or more plugins.

QIIME 2 and its plugins use several different software packages under the hood, each of which has its own format requirements. We try to keep format requirements as flexible as possible, but we do set format requirements when (a) it is required for smooth operation or (b) just makes sense.

QIIME 2 only validates the first few lines upon import, by default, but you can always use qiime tools validate to validate the entire file (this is in general recommended).

It just checks the first few lines upon import, it does not read and rewrite the entire file.

This is definitely user support!

Keep on QIIMEing

system · November 16, 2019, 4:40am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.