Background
Version of qiime2: qiime2-2023.2 installed with conda on an HPC.
I'm following the classifier training tutorial. Below is the code I'm using for importing the reference sequences from HOMD to a .qza file. I've also attached the resulting .qza file: ref-seqs.qza (19.3 KB). I can't upload the fasta file: HOMD_16S_rRNA_RefSeq_V15.22.p9 so here's a hyperlink to the HOMD website for download. (I used Version 15.22 starting at position 9).
Problem
When I compare the sequences in the .qza file and the original .fasta from HOMD, there are many sequences missing. Is there is a formatting error that is causing this behavior?
While Qiime2 did not give me any errors when I imported my reference sequences, the classifier trained with these imported sequences had poor resolution when compared to a full-sequence pre-trained SILVA classifier. I also had very different results compared to a colleague who used DADA2 to assign taxonomy using the same database.
Example:
This sequence is present in the original RefSeq file
81531021 | Methanobrevibacter oralis | HMT-815 | Strain: DSM 7256 | PROKKA: SEQF3102_01675 | Status: Named | Preferred Habitat: Oral | Genome: Genome_GB: LWMU01000001.1
But this sequence not exist in the DNA-sequences in my .qza file.
Code
#!/bin/bash
# file: 00_train-classifier.sbatch
# purpose: train classifier based on database of choice
# input:
# assign parameters: reFasta, reTaxonomy, fPrimer, rPrimer
# $reFasta = path to database reference sequences
# $reTaxonomy = path to database taxonomic classifications
# ${REF_DATABASE} = eHOMD/silva
# output:
# classifier.qza (04_classify-filter)
# load parameters
dos2unix ./config.sh
source ./config.sh
# record time and method name, env vars
echo -e $(date)
echo -e "importing reference datasets..."
qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path $reFasta.fasta \
--output-path $reFasta.qza
If there is formatting that I need to change, is there a reproducible way to do this?