Training-feature-classifier 2019.10: Error


I have a quick question regarding training-feature-classifier. I had previously trained a classifier using qiime2-2019.4, but I figure since the new update came out, I might as well train a new classifier. I am doing all the work on the cluster, so it should work, but I am having an error stating:

There was a problem importing ------
------sh_refs_qiime_ver8_dynamic_s_02.02.2019_dev.fasta is not a(n)
DNAFASTAFormat file:
------Invalid characters on line 538 (does not match IUPAC characters for a
DNA sequence).

This is the code that I ran, which is a code I had used before, but for some reason it is not working.

If you could please let me know how to fix this problem, I would really appreciate it.

Hi! Looks like you should open your fasta file and check manually which symbol at line 538 cause the error and does not match IUPAC characters

Okay, I will go ahead and try that, and I will let you know what I find.

After looking at line 548, on the file I do not see what is wrong with it. All the letters are capital cased and don't appear to have any oddities. Any other suggestions?

Hey @Fabs, did you manage to dodge this problem?

I have similar issue, different line tho:

There was a problem importing sh_refs_qiime_ver8_dynamic_02.02.2019_dev.fasta:

** sh_refs_qiime_ver8_dynamic_02.02.2019_dev.fasta is not a(n) DNAFASTAFormat file:**

** Invalid characters on line 53104 (does not match IUPAC characters for a DNA sequence).**

I don't see any other than IUPAC characters at line 53104 (below picture):

Also here waiting for suggestions or similar experiences (or how to get pass this problem)?


Hello @fabs! Apologies for the delayed reply

I attempted to replicate your issue by running the bash script you posted, and received the following error on a completely different line from you or @veeraku!


When I pop the file open in vim and go to line 67868, I notice there is a leading space.


This makes me wonder if perhaps you have some trailing whitespace characters on line 548 that are being picked up as non DNA characters, and that's why your import is failing on that line without there being any visibly incorrect characters?

Additionally, when I run the qiime import command without running that awk command on the files, I can successfully import the original file,


and the leading space is not present in the original file.


This leads me to believe there is something in that awk command that is somehow inserting unwanted characters into the file. May I ask why you run that command? Your comment says it is to change the format of the files, but I'm not sure I understand the manner in which it is meant to change the format as the files are already uppercase.

Thank you, @Oddant1

Hi Oddant1

Thank you for your response. :slight_smile:

When I import the files as you suggested, I continue to get an error. I am working with the developer files, so I need to make sure that the files are all in uppercase format, so per a different post, I ran the awk command.

After rerunning the code, I did get the same line number error as you did. I went ahead and removed the leading spaces using sed “s/^[ \t]*//” -i and the file was able to import.

@veeraku (Hope this helps)
This is the code I ran, in case anyone needs it (I was not sure how to pipe it as one line, but it works)

#Create the uppercase file
awk ‘/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}’ sh_refs_qiime_ver8_99_s_02.02.2019_dev.fasta > sh_refs_qiime_ver8_99_s_02.02.2019_dev_up.fasta

#Remove the leading white spaces
sed “s/^[ \t]*//” -i sh_refs_qiime_ver8_99_s_02.02.2019_dev_up.fasta > sh_refs_qiime_ver8_99_s_02.02.2019_dev_upperase.fasta

Hello again,

I just realized that my fix did not actually work, since a file is created, but the file is returned as empty. Can someone please help me fix the original problem?

Hello Again,

Please disregard, I have figured out the problem. I had to give the sed command only the line in question.

sed ‘67868 s/ //g’ and this seems to have worked

1 Like

Thanks for your code @Fabs. But when I run sed ‘67868 s/ //g’ -i its_correct_reference_sequences.fasta > its_correct_refer_sequences.fasta or sed ‘67868 s/^[ \t]*//g’ -i its_correct_reference_sequences.fasta > its_correct_refer_sequences.fasta, I always get a empty file. Can you give me some suggestion?

remove the -i. It stands for “in place” and so it does not produce an output (it edits the file in place). I think that should fix this (produce an output as intended).

Thanks very much @Nicholas_Bokulich. It works. I found that the UNITE reference sequences file can be smoothly imported in qiime2-2018.11 but it met the “blank” problem in the qiim2-2019.10. Is the new version stricter for the format of sequences?

Interesting, thanks for reporting. Some new format validation was added around 2019.10 so it is stricter, please post the commands you are using to download, format, and import.

Thanks for your reply @Nicholas_Bokulich. The UNITE reference sequence was dowload from and then the file was unzipped. I used sh_refs_qiime_ver8_dynamics_s_02.02.2019_dev.fasta file. I used the following command to import reference sequences into two qiime version: qiime tools import --type ‘FeatureData[Sequence]’ --input-path sh_refs_qiime_ver8_dynamics_s_02.02.2019_dev.fasta --output-path its-refer.qza

Sorry for the delayed response, @whitewind123

I recently updated this tutorial, please give it a try, I was recently able to use it without issue (you may need to update some filepaths if you are using a different UNITE release):

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.