Error in importing reference data set its_12_11_otus.tar.gz

Hello, All,

I am newbie to Qiime2 and programming, I got problem in importing reference data set its_12_11_otus.tar.gz

When I was trying to create classifier file I got the following error

Plugin error from feature-classifier:

Invalid character in sequence: b’_’.
Valid characters: [’.’, ‘H’, ‘Y’, ‘K’, ‘M’, ‘D’, ‘A’, ‘N’, ‘S’, ‘T’, ‘V’, ‘B’, ‘-’, ‘W’, ‘R’, ‘C’, ‘G’]
Note: Use lowercase if your sequence contains lowercase characters not in the sequence’s alphabet.

I have even tried this command line tr ‘acgt’ ‘ACGT’ < 97_otus.fasta.uppercase.fa > 97_otus.fasta.uppercase.fa.qza. But still, I was getting that same error.

Command line which I used was

qiime tools import --input-path 97_otus.fasta.uppercase.fa --output-path 97_otus.fasta.uppercase.fa.qza --type FeatureData[Sequence]

qiime tools import --type ‘FeatureData[Taxonomy]’ --input-format HeaderlessTSVTaxonomyFormat --input-path 97_otu_taxonomy.txt --output-path 97_otu_taxonomy.txt.qza

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads 97_otus.fasta.uppercase.fa.qza --i-reference-taxonomy 97_otu_taxonomyf.txt.qza --o-classifier 97classifier.qza

Could anyone please give suggestions about how to rectify this error? Thanking you all in anticipation.

Welcome @Asha1!

Uh oh! Sounds like your reference sequences have a '_' character somewhere... my guess is there may even be a bunch of these, possibly used as a gap character. Remove these characters from the file and you should be okay (just make sure that you are not also removing these from the sequence IDs!)

You should figure out what these underscores represent before removing or replacing them.

Good try! But that will only fix issues with lowercase characters, not with underscores in the sequences.

Try this:

grep '>' 97_otus.fasta.uppercase.fa | grep '_' | wc -l

If that outputs "0", it indicates that your sequence IDs do not contain underscores. If so, you can proceed with the commands below.

grep -v '>' 97_otus.fasta.uppercase.fa | grep '_' | wc -l

That will tell you how many sequences have underscores. If sequence IDs contain underscores but only a small number of sequences contain underscores, you might be able to just manually search and remove the underscores.

If you sequence IDs do not contain underscores, you can just do the following to remove them all:

tr -d '_' < 97_otus.fasta.uppercase.fa > 97_otus.fasta.uppercase.clean.fasta

Or the following to replace them with something else:

tr '_' '-' < 97_otus.fasta.uppercase.fa > 97_otus.fasta.uppercase.clean.fasta

Let us know what you figure out!

Dear Dr.Nicholas Bokulich
As per your suggestion, I applied your command to rectify the error. I am not getting zero value, still there was some problem.

$ grep '>' 97_otus.fasta.uppercase.fa | wc -l
55404
$ grep -v '>' 97_otus.fasta.uppercase.fa | wc -l
55404
$ tr -d '_' < 97_otus.fasta.uppercase.fa > 97_otus.fasta.uppercase.clean.fasta
$ grep '>' 97_otus.fasta.uppercase.clean.fasta | wc -l
55404
$ tr '_' '-' < 97_otus.fasta.uppercase.fa > 97_otus.fasta.uppercase.clean.fasta
$ grep '>' 97_otus.fasta.uppercase.clean.fasta | wc -l
55404

oops I wrote the commands above too early in the morning! I have edited above; please see the new grep commands written above and give those a try and let us know what you see.

Also see the edits I made to your post; notice the backticks that I added, which facilitate nicer display of code blocks.

Thanks!

1 Like

Dear sir,

After using ur rectified command also, same error (plugin error from feature classifier) came again. Herewith I have attached snapshot of error for your reference.
Kindly let me know what I have do further?
My reference database: 97_otus.fasta.gz

@Asha1,
There are clearly many problems with the raw fasta data you are trying to import!

You are receiving the same error — invalid characters in the sequence — but for different characters. I see that you were able to fix the lowercase characters in your fasta. Now the issue is an invisible character \x0b in your file.

Perhaps you opened and modified this file at some point using MS Word or another word processor? This is most likely where these invisible characters became inserted, and if so I recommend downloading the raw data and starting over again since these characters should not be present in a fasta file. If not, then you will need to search the file for these characters (and likely other invisible characters that do not belong in a fasta file) and remove them either manually or using bash.

All I can say at this point is — I wish you good luck in hunting these invalid characters!