RIM-DB classifier

Hi
I am struggle to train a classifier called RIM-DB (Rumen and Intestinal Methanogen-DB). Since, some issues are occured while importing the fastq file and taxonomy file.

qiime tools import
--type 'FeatureData[Sequence]'
--input-path RIM_DB_14_07.fasta
--output-path RIM_Db_14_07otus.qza

I followed this command based on a question in qiime forum.

However, not working in my computer.
Please, if you find any mistake here, let me know experts.

Hi @You_y_Choi,

Let's break down these error's:

  1. The "not a DNAFASTAFormat file" error is telling you that the file does not conform to QIIME 2's DNAFASTAFormat type. Which essentially means that the sequence data you are importing must only contain valid DNA (not RNA) IUPAC nucleotides, and must be capitalized. Check to make sure there are no special characters and/or no lower-case characters in your DNA sequences. You can look into the following threads for more info on changing to uppercase:

That is you can use seqkit:

conda install seqkit
seqkit seq db.fasta --upper-case -w 0 > db-upper.fasta

or bioawk:

conda install -c bioconda bioawk
bioawk -c fastx '{print ">" $name;  print toupper($seq)}' db.fasta > db-upper.fasta
  1. For "ID on line 831 is a duplicate of another ID on line 829." and "Taxonomy format feature IDs must be unique." In QIIME 2, all IDs must be unique. I would make sure that there are no underscores (_) in the ID names. Some various tools and code wrapped by QIIME 2 will default reading any text prior to the first _ as being the ID and discard anything else afterwards. Thus, we recommend that all IDs follow this schema, to avoid mis-reading of data labels / IDs.
3 Likes

Thank you Mike for your advice

Taxonomy file worked as your suggestion.

However, the fastq file still doesn't work even if I do the seqkit process.

(I have performed this process in both 2019.10 and 2021.11 verstion of qiime2).

Thanks

It seems there is a U base in the first fasta sequence (possibly more) in the sequences, as stated by the error only 'ACGTRYKMSWBDHVN' characters are allowed.

If you can easily open the fasta file in a text editor like gedit or notepad++ (dependant on OS) and find and replace all of the Us for Ts that will probably solve it.

There is probably a better way to do this as if there are capital Us in the fasta headers then they will also be replaced.

2 Likes

To extend @Micro_Biologist's answer, you can run the following additional command (after converting to upper-case) via bioawk:

Convert U to T (after running the previously mentioned commands to convert to upper-case)

bioawk -c fastx 'gsub("U","T") $seq {print ">" $name; print $seq}' db-upper.fasta > db-upper-dna.fasta

Or you can do both it in one shot like:

bioawk -c fastx '{print ">" $name; gsub("u","t",$seq); print toupper($seq)}' db.fasta > db-upper-dna.fasta
1 Like

Thank you for @Micro_Biologist and @SoilRotifer

As your suggestion, I changed "U" to "T" and it perfectly worked.

However, the same problem seems to persist. I have also chekced the last fastq file (db-upper-dna.fasta) to check line 831 and 829.

Can the same sample ID but different sequence create this issue?

I have downloaded this fastq file at RIM-DB: a taxonomic framework for community structure analysis of methanogenic archaea from the rumen and other intestinal environments [PeerJ].

I appreciate for your kind help.

HI @Micro_Biologist ,as I mentioned earlier in this thread the IDs must be unique. If they are not you'll have to modify them in both the sequence and taxonomy files. For example, you can append an incremented number to the end of each ID:

AE008384
AE008384.2
AE008384.3
...
AE010299
AE010299.2
AE010299.3
...

1 Like

Thank you very much for your help @SoilRotifer.

I perfectly made RIM-DB classifier well. :grinning:

1 Like

Yay! I'm glad it worked! :tada:

1 Like