Import RefSeq Fasta - Losing Sequences from HOMD database


Version of qiime2: qiime2-2023.2 installed with conda on an HPC.

I'm following the classifier training tutorial. Below is the code I'm using for importing the reference sequences from HOMD to a .qza file. I've also attached the resulting .qza file: ref-seqs.qza (19.3 KB). I can't upload the fasta file: HOMD_16S_rRNA_RefSeq_V15.22.p9 so here's a hyperlink to the HOMD website for download. (I used Version 15.22 starting at position 9).


When I compare the sequences in the .qza file and the original .fasta from HOMD, there are many sequences missing. Is there is a formatting error that is causing this behavior?

While Qiime2 did not give me any errors when I imported my reference sequences, the classifier trained with these imported sequences had poor resolution when compared to a full-sequence pre-trained SILVA classifier. I also had very different results compared to a colleague who used DADA2 to assign taxonomy using the same database.


This sequence is present in the original RefSeq file

81531021 | Methanobrevibacter oralis | HMT-815 | Strain: DSM 7256 | PROKKA: SEQF3102_01675 | Status: Named | Preferred Habitat: Oral | Genome: Genome_GB: LWMU01000001.1

But this sequence not exist in the DNA-sequences in my .qza file.


# file: 00_train-classifier.sbatch
# purpose: train classifier based on database of choice 
# input: 
	# assign parameters: reFasta, reTaxonomy, fPrimer, rPrimer
    # $reFasta = path to database reference sequences
    # $reTaxonomy = path to database taxonomic classifications
		# ${REF_DATABASE} = eHOMD/silva
# output: 
	# classifier.qza (04_classify-filter) 

# load parameters
dos2unix ./
source ./

# record time and method name, env vars
echo -e $(date)

echo -e "importing reference datasets..." 
qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path $reFasta.fasta \
  --output-path $reFasta.qza

If there is formatting that I need to change, is there a reproducible way to do this?

Hi @klterwelp ,

You are not losing sequences during importing. I checked this myself just now and the sequence file imports just fine without loss of any sequences.

The file that you shared was imported and then segments were extracted using qiime feature-classifier extract-reads. This was not mentioned in the code block that you shared.

So the reads are being lost during read extraction because the primer only hits some of the sequences. This could be because the primers have low coverage to begin with, but two options to improve:

  1. use degenerate sequences to widen the search
  2. use the RESCRIPt plugin actions trim-alignment or extract-seq-segments to extract sequences for the region that you want instead. These actions allow somewhat more tolerance for trimming based on positions in longer alignments. You need to download this plugin separately, it is not bundled with QIIME 2. This can be particularly useful for partial alignments, e.g., many reads might not hit your forward primer (a 27F variant) because they are in fact not full-length 16S rRNA gene sequences but only partial genes.
  3. don't extract subsequences at all.

Good luck!


Hello @Nicholas_Bokulich! Thank you for your help. :smile: That's definitely the problem, I'm sorry for not adding the rest of my code!

I'll try using the RESCRIPt plugin and the trim-alignment / extract-seq-segments commands. I think these primers may have lower coverage so I predict I'll have better luck with those methods. I'm used to working with 16S V4 data, so I've been learning a lot with these V1-V3 primers.


Yeah... V1-V3 can be tricky and was actually one of the motivating use cases behind adding those actions to RESCRIPt!

Let us know if you have any questions. :grin:


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.