Training-feature-classifier 2019.10: Error

Fabs · December 31, 2019, 4:02am

Hello,

I have a quick question regarding training-feature-classifier. I had previously trained a classifier using qiime2-2019.4, but I figure since the new update came out, I might as well train a new classifier. I am doing all the work on the cluster, so it should work, but I am having an error stating:

There was a problem importing ------
------sh_refs_qiime_ver8_dynamic_s_02.02.2019_dev.fasta:
------sh_refs_qiime_ver8_dynamic_s_02.02.2019_dev.fasta is not a(n)
DNAFASTAFormat file:
------Invalid characters on line 538 (does not match IUPAC characters for a
DNA sequence).

This is the code that I ran, which is a code I had used before, but for some reason it is not working.

If you could please let me know how to fix this problem, I would really appreciate it.

timanix · December 31, 2019, 7:36am

Hi! Looks like you should open your fasta file and check manually which symbol at line 538 cause the error and does not match IUPAC characters

Fabs · January 1, 2020, 10:26pm

Okay, I will go ahead and try that, and I will let you know what I find.

Fabs · January 1, 2020, 10:34pm

After looking at line 548, on the file I do not see what is wrong with it. All the letters are capital cased and don't appear to have any oddities. Any other suggestions?

veeraku · January 3, 2020, 10:12am

Hey @Fabs, did you manage to dodge this problem?

I have similar issue, different line tho:

There was a problem importing sh_refs_qiime_ver8_dynamic_02.02.2019_dev.fasta:

** sh_refs_qiime_ver8_dynamic_02.02.2019_dev.fasta is not a(n) DNAFASTAFormat file:**

** Invalid characters on line 53104 (does not match IUPAC characters for a DNA sequence).**

I don't see any other than IUPAC characters at line 53104 (below picture):

Also here waiting for suggestions or similar experiences (or how to get pass this problem)?

Cheers!

Oddant1 · January 4, 2020, 5:51am

Hello @fabs! Apologies for the delayed reply

I attempted to replicate your issue by running the bash script you posted, and received the following error on a completely different line from you or @veeraku!

When I pop the file open in vim and go to line 67868, I notice there is a leading space.

This makes me wonder if perhaps you have some trailing whitespace characters on line 548 that are being picked up as non DNA characters, and that's why your import is failing on that line without there being any visibly incorrect characters?

Additionally, when I run the qiime import command without running that awk command on the files, I can successfully import the original file,

and the leading space is not present in the original file.

This leads me to believe there is something in that awk command that is somehow inserting unwanted characters into the file. May I ask why you run that command? Your comment says it is to change the format of the files, but I'm not sure I understand the manner in which it is meant to change the format as the files are already uppercase.

Thank you, @Oddant1

Fabs · January 4, 2020, 11:27am

Hi Oddant1

Thank you for your response.

When I import the files as you suggested, I continue to get an error. I am working with the developer files, so I need to make sure that the files are all in uppercase format, so per a different post, I ran the awk command.

After rerunning the code, I did get the same line number error as you did. I went ahead and removed the leading spaces using sed "s/^[1]*//" -i and the file was able to import.

@veeraku (Hope this helps)
This is the code I ran, in case anyone needs it (I was not sure how to pipe it as one line, but it works)

#Create the uppercase file
awk '/^>/ {print($0)}; /^[2]/ {print(toupper($0))}' sh_refs_qiime_ver8_99_s_02.02.2019_dev.fasta > sh_refs_qiime_ver8_99_s_02.02.2019_dev_up.fasta

#Remove the leading white spaces
sed "s/^[3]*//" -i sh_refs_qiime_ver8_99_s_02.02.2019_dev_up.fasta > sh_refs_qiime_ver8_99_s_02.02.2019_dev_upperase.fasta

\t ↩︎
^> ↩︎
\t ↩︎

Fabs · January 6, 2020, 7:30pm

Hello again,

I just realized that my fix did not actually work, since a file is created, but the file is returned as empty. Can someone please help me fix the original problem?

Fabs · January 6, 2020, 8:38pm

Hello Again,

Please disregard, I have figured out the problem. I had to give the sed command only the line in question.

sed '67868 s/ //g' and this seems to have worked

whitewind123 · January 12, 2020, 3:51pm

Thanks for your code @Fabs. But when I run sed '67868 s/ //g' -i its_correct_reference_sequences.fasta > its_correct_refer_sequences.fasta or sed '67868 s/^[1]*//g' -i its_correct_reference_sequences.fasta > its_correct_refer_sequences.fasta, I always get a empty file. Can you give me some suggestion?

\t ↩︎

Nicholas_Bokulich · January 13, 2020, 2:35am

remove the -i. It stands for "in place" and so it does not produce an output (it edits the file in place). I think that should fix this (produce an output as intended).

whitewind123 · January 13, 2020, 8:44am

Thanks very much @Nicholas_Bokulich. It works. I found that the UNITE reference sequences file can be smoothly imported in qiime2-2018.11 but it met the "blank" problem in the qiim2-2019.10. Is the new version stricter for the format of sequences?

Nicholas_Bokulich · January 13, 2020, 3:32pm

Interesting, thanks for reporting. Some new format validation was added around 2019.10 so it is stricter, please post the commands you are using to download, format, and import.

whitewind123 · January 14, 2020, 4:45am

Thanks for your reply @Nicholas_Bokulich. The UNITE reference sequence was dowload from PlutoF Biodiversity Platform and then the file was unzipped. I used sh_refs_qiime_ver8_dynamics_s_02.02.2019_dev.fasta file. I used the following command to import reference sequences into two qiime version: qiime tools import --type 'FeatureData[Sequence]' --input-path sh_refs_qiime_ver8_dynamics_s_02.02.2019_dev.fasta --output-path its-refer.qza

Nicholas_Bokulich · January 28, 2020, 12:05am

Sorry for the delayed response, @whitewind123

I recently updated this tutorial, please give it a try, I was recently able to use it without issue (you may need to update some filepaths if you are using a different UNITE release):