How to customise a properly formatted feature classifier?

Summary:
In a word, I'm trying to make my own 16s feature classifiers similar to gg2 and silva. But when I tried to import my 16s sequence library, I was informed that I cannot have sequences with the same id. This confuses me because various bacteria just have plural 16s genes and different sequences, also known as having interspecies gaps.

The commands and feedback are as follows:

qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path 16s.fasta \
--output-path 16s.qza
 
There was a problem importing 16s.fasta:
16s.txt is not a(n) DNAFASTAFormat file:
ID on line 22 is a duplicate of another ID on line 1.

My 16s.fasta file and the taxonomy file I'm going to use are probably in the following format:

16s.fasta:
>398511
TTTATGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACTGATGGGAG...
>398511
TTTATGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGACTGATGGGAG...
>511145
AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGA...
>511145
AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGA...

taxonomy:
398511	Bacteria;Bacillota;Bacilli;Bacillales;Bacillaceae;Alkalihalophilus;Alkalihalophilus pseudofirmus
511145	Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli

This information can be obtained in the following manner:

  1. List of genome RefSeq accession ids after searching for custom filtered information from ncbi.
  2. Modify the accession id list to get the ftp address directory.
  3. Obtain report.txt with taxid information and *_rna_from_genomic.fna.gz file with 16s sequence information via ftp batch.
  4. Unzip and extract the 16s sequence information from the *-rna_from_genomic.fna.gz file.
  5. Use taxonkit script tool to get ncbi standard classification information via taxid. Compose the taxonomy.txt file containing the taxid and taxonomy information.
398511	Bacteria;Bacillota;Bacilli;Bacillales;Bacillaceae;Alkalihalophilus;Alkalihalophilus pseudofirmus
511145	Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
306537	Bacteria;Actinomycetota;Actinomycetes;Mycobacteriales;Corynebacteriaceae;Corynebacterium;Corynebacterium jeikeium
  1. Modify the string after the ">" in the 16s sequence information file to a taxid to correspond to the taxonomy information in the taxonomy.txt file. Also known as the 16s.fasta file.
>511145
AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGA
>306537
TTTATGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCTCCTT
>306537
TTTATGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCTCCTT

When I was about to use these two files to access the qiime2 process, the error mentioned at the beginning occurred, i.e., id duplication is not allowed.

My question is, if multiple 16s rRNA sequences are not allowed for one species (id), how does the feature classifier recognise different 16s rRNA sequences from the same species? If it is allowed to have multiple 16s rRNA sequences for one species (id), then where did my workflow go wrong?

Since I'm having problems with my first step, I also can't verify if my taxonomy.txt is a file that can be submitted to qiime2. If there is an error somewhere in what I'm doing, can I please please please please have some guidance from the friends on the forum. Thanks a lot.

Each sequence is given a unique ID.

So for example in your case you have multiple sequences derived from the same accession. You could name these like:

306537.1
306537.2
...

Or any unique ID is fine, but presumably you want to be able to link these back to the original accession ID.

Your taxonomy would then also need to have the exact same accession IDs (and likewise requires unique IDs). So, e.g.,

306537.1	Bacteria;Actinomycetota;Actinomycetes;Mycobacteriales;Corynebacteriaceae;Corynebacterium;Corynebacterium jeikeium
306537.2	Bacteria;Actinomycetota;Actinomycetes;Mycobacteriales;Corynebacteriaceae;Corynebacterium;Corynebacterium jeikeium
...

Good luck!

3 Likes

Thankyou! I followed yout method and successfully solved this problem!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.