Merging Custom Reference Sequences with rescript reference database

alexkrohn · December 11, 2024, 3:21pm

I'm combining reference sequences generated with Sanger sequencing with those downloaded by rescript, and am getting the error that one of the sequences present in the reference sequences is not present in the taxonomy.

Error:

'Identifier FISH013_12S_Luxilus_coccogenis_R was reported in taxonomic search results, but was not present in the reference taxonomy.'

When running this command:

qiime feature-classifier classify-consensus-blast \ 
--i-query r1-results/r1-rep-seqs-dada.qza \
--i-reference-taxonomy 12s-fish-refseq-sanger-filtered-merged-taxonomy.qza \
--i-reference-reads 12s-fish-refseq-sanger-filtered-merged-seqs.qza \
--o-classification r1-results/12s-r1-blasted-taxonomy.qza

12s-fish-refseq-sanger-filtered-merged-taxonomy.qza is the result of merging a headerless taxonomy (attached) with the taxonomy downloaded from NCBI using rescript.

12s-fish-refseq-sanger-filtered-merged-seqs.qza is the results of merging the Sanger fasta (attached, yes I know it's poor quality sequencing...) with the reference sequences downloaded from NCBI using rescript.

grep reveals that FISH013_12S_Luxilus_coccogenis_R is present in both taxonomy and sequences

unzip -c 12s-fish-refseq-sanger-filtered-merged-taxonomy.qza | grep 'FISH013_12S_Luxilus_coccogenis_R'

>FISH013_12S_Luxilus_coccogenis_R       k__Animalia;p__Chordata;c__Actinopterygii;o__Cypriniformes;f__Cyprinidae;g__Luxilus;s__coccogenis;

unzip -c 12s-fish-refseq-sanger-filtered-merged-seqs.qza | grep -A 1 'FISH013_12S_Luxilus_coccogenis_R'

>FISH013_12S_Luxilus_coccogenis_R
TAGGTAACTTTATTACATTTCGACAGGGGAGAGTGACGGGCGGTGTGTACGCGCCTCAGAGCCGGGTTCAAAAGGACACGCTGTTTCCTTTTTACTACTAAATCCTCCTTCAAGCACTATTTCATGTTGCATATCCGTAGTGTTCTATAATAGAAAATGTAGCCCATTTCTTCCCGCTCCGTACGCTACACCTCGACCTGACGTTCTGGGCTGTGCCCATTTTGCTTACTCTTATTACCTTCACAGGGTAAGCTGACGACGGCGGNATATAGGCAN

Comparing the file names in the taxonomy file to the headers in the fasta file using setdiff in R shows no differences either. Both import to QIIME without a problem.

Per other searches, I've tried dos2unix, which did not help. The taxonomy and fasta files were made on a mac. I assume it's a formatting problem with tabs or returns, but I'm not sure where to start. FISH013_12S_Luxilus_coccogenis_R is in the middle of the files, so some sequences seem to match just fine...

I'm running Qiime2 2022.2 on Ubuntu 2024.4. (I'm using the older version of Qiime because it was the last version that I could get running with my older chipset, Intel Xeon E7.)

12s_sanger_taxonomy.txt (13.5 KB)
12s_sanger_reference_seqs.txt (31.5 KB)

SoilRotifer · December 11, 2024, 5:47pm

Hi @alexkrohn,

The issue is that you forgot to remove the '>' characters from the taxonomy file.

I was able to successfully import like so:

qiime tools import \
    --input-format HeaderlessTSVTaxonomyFormat \
    --type 'FeatureData[Taxonomy]' \
    --input-path 12s_sanger_taxonomy_fixed.txt \
    --output-path 12s_sanger_taxonomy_fixed.qza

qiime tools import \
    --input-format DNAFASTAFormat \
    --type 'FeatureData[Sequence]' \
    --input-path 12s_sanger_reference_seqs.fasta \
    --output-path 12s_sanger_reference_seqs.qza

As I had no data to classify, I decided to make a Naïve Bayes classifier as a test:

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads 12s_sanger_reference_seqs.qza \
    --i-reference-taxonomy 12s_sanger_taxonomy_fixed.qza \
    --o-classifier 12S_nb-classifier.qza

It worked which means that leaving the '>' in the taxonomy file was the issue, as the files did not have matching IDs. That is, trying to find FISH013_12S_Luxilus_coccogenis_R in the taxonomy file when it was actually >FISH013_12S_Luxilus_coccogenis_R.

-Cheers!

system · January 11, 2025, 11:48pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.