Hi QIIME2 folks. I am getting an error while using a .qza file train a classifier.
The error is pretty obvious and I have seen others like it:
Invalid characters on line 3 (does not match IUPAC characters for a DNA sequence).
The fasta file embedded in the .qza in data has empty lines in between the sequences, which I believe isn't supported by q2, but I am puzzled. The .qza was created and used successfully by a colleague to run the same code. I get the same error when trying to rebuild a .qza file to correct the reference sequences. So, did something change in how .qza files are built or the fasta file is checked?
Also, does q2 simply check the format and, if OK, put it in the data folder within the zipped .qza, or does it rewrite the file (which sounds unwieldy).
Could this be a version issue, as I am using the latest update and he wasn't?
I imagine I need to have a fasta file without the empty lines, but trying to understand why the .qza file doesn't work for me but worked for him.
Also, given the error, can someone suggest a tool to parse my "invalid" fasta file?
Sorry if this falls outside the realm of User Support - trying to teach myself, but clearly not quite there. Feel free to redirect me if this belongs somewhere else.
What I ran (in a Jupyter notebook):
!qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ../nash/Downloads/XXotus.qza
--i-reference-taxonomy ../nash/Downloads/XXref-taxonomy.qza \
--o-classifier XX12s-classifier-no-extract-reads.qza
Plugin error from feature-classifier:
/var/folders/z2/bjsdm3_n4jz_wv6n2c6n1gfc0000gn/T/qiime2-archive-08clw6ud/26b4aa19-28bb-4376-a4ea-42fc13b6446d/data/dna-sequences.fasta is not a(n) DNAFASTAFormat file:
Invalid characters on line 3 (does not match IUPAC characters for a DNA sequence).
Debug info has been saved to /var/folders/z2/bjsdm3_n4jz_wv6n2c6n1gfc0000gn/T/qiime2-q2cli-err-2bs_ji12.log
The top few lines in the fasta file embedded in the .qza showing the empty lines:
1
ACTATGCACAGCCCTAAACTTTGATAGAAACATTACACCCACTATCCGCCAGGGTACTACGAGCTCTAGCTTAAAATCCAAAGGACTTGGCGGTGCTTTAGACCCAC
2
ACTATGCCTAGCCCTAAACATTGGCAACACAAAACACCCGTTGCCCGCCAGGGCACTACGAGCATTAGCTTAAAACCCAAAGGACTTGGCGGTGCTTTAGACCCAC
4
ACTATGCTTAACTGTAAACAAAGATGATAATACACAAACATCATCCGCCAGGGGATTACGAGCAAAGTTTAAAACCCAAAGGACTTGGCGGTGCCTCAAACCCAC
6
ACTATGCCCTGCCGTAAACTTAGATATTTCAATACAACAAATATCCGCCCGGGGACTACGAGCGCCAGCTTAAAACCCAAAGGACTTGGCGGTGCTTCAGACCCCC