Issue with Taxonomy Assignment Using Kraken and Custom Database (BOLD, Nemabiome) in QIIME2

SingeunOh · February 7, 2025, 4:02am

Hello everyone,

I’m currently using Kraken2 for taxonomy assignment of shotgun metagenomic sequences within QIIME2. However, I’ve encountered strange results when comparing Kraken’s assignments to those obtained from NCBI BLAST.

I’m using databases that of BOLD (Barcode of Life Database) and Nemabiome. The problem maybe appear to stem from the fact that these databases use different accession number formats and taxonomic ID systems compared to the standard NCBI format.

This is just guessing from the anwer of GPT

Could this discrepancy in accession numbers and taxon IDs be causing incorrect taxonomy assignments in Kraken 2? If so, what would be the best way to resolve this issue and ensure accurate taxonomy assignment in QIIME2 using my custom database?

Or other cause of the problem ?

Any advice or recommendations on how to properly format my custom database for Kraken 2 in QIIME2 would be greatly appreciated!

ebolyen · February 7, 2025, 8:20pm

Hi @SingeunOh,

Could you provide an example of an unusual assignment?

It should be the case that if you use a different reference with different accession schemes to create the database that the Kraken2 report will use those IDs in its output, and the resulting QIIME 2 artifacts will have those ids as the feature-ids. But I haven't personally tested Kraken with non NCBI data, so it is plausible there are hiccups there.

But generally speaking, QIIME 2 doesn't assume that the IDs will be NCBI accessions.

@misialq, have you tried a non-NCBI database with Kraken? I know we've discussed this problem in a hypothetical sense before...

misialq · February 10, 2025, 12:55pm

Hey @SingeunOh, @ebolyen,

Unfortunately, I cannot say I have tried anything else than the NCBI ones... Perhaps an example could help us understand what's going on a bit better, as @ebolyen already pointed out...

SingeunOh · February 12, 2025, 3:53am

Here is a summary of the results:

From the BOLD database, the taxonomic assignment looks as follows:

BOLDSYSTEMS - using this db

Using this, the result is as
100.00 4905 0 R 1 root
99.70 4890 0 R1 131567 cellular organisms
45.71 2242 0 D 2759 Eukaryota
45.63 2238 0 D1 33154 Opisthokonta
45.61 2237 0 K 33208 Metazoa
45.61 2237 0 K1 6072 Eumetazoa
45.61 2237 0 K2 33213 Bilateria
43.41 2129 0 K3 33317 Protostomia
0.12 6 0 K4 2697495 Spiralia
0.12 6 0 K5 1206795 Lophotrochozoa
0.12 6 0 P 6157 Platyhelminthes
0.10 5 0 C 6199 Cestoda
0.10 5 0 C1 6200 Eucestoda
0.10 5 0 O 56708 Tetraphyllidea
0.10 5 0 F 2835625 Rhoptrobothriidae
0.10 5 0 G 2835626 Myzophyllobothrium
0.10 5 0 G1 2835628 Unclassified Myzophyllobothrium
0.10 5 5 S 2835629 Myzophyllobothrium sp. 'cf. nagasawai'

The last result is Myzophyllobothrium sp. 'cf. nagasawai'.

This result corresponds to 5 sequences as follows:

C SRR16510773.5626088 2835629 52 131567:2 2835629:2 0:14
@SRR16510773.5626088 5626088 length=52
CTCTACGCATTTCACCGCTACACCTGGAATTCTACCTTCCTCTCACATACTC
+SRR16510773.5626088 5626088 length=52
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

C SRR16510773.5944974 2835629 54 0:15 2835629:2 131567:3
@SRR16510773.5944974 5944974 length=54
AGAGTATGTGAGAGGAAGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGA
+SRR16510773.5944974 5944974 length=54
AAAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

C SRR16510773.6212720 2835629 53 0:12 2835629:2 131567:2 A:3
@SRR16510773.6212720 6212720 length=53
GTATGTGAGAGGAAGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATC
+SRR16510773.6212720 6212720 length=53
AAAAAEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEAEEEEEE/EE

C SRR16510773.6568304 2835629 66 0:28 2835629:2 131567:2
@SRR16510773.6568304 6568304 length=66
AAACTGTAGAGCTAGAGTATGTGAGAGGAAGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAG
+SRR16510773.6568304 6568304 length=66
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

The sequence from the BOLD database that matches the above sequences is: >2835629:emu_db:1|kraken:taxid|2835629
ATTAATACTCTAGGATANTGGACGTTACTCNCANAATAAGCACCGGCTAACTCTGTGCCA
GCAGCCGCGGTAATACAGAGGGTGCGAGCGTTANTCGGATTTACTGGGCGTAAAGCGTGC
GTAGGCGGCTTNTTAAGTCGGATGTGAAATCCCTGAGCTTAACTTAGGAATNGCATTCGA
TACTGGGAAGCTAGAGTNTGGGAGAGGANGGTAGAATTCCAGGTGTAGCGGTGAAATGCG
TAGAGAT

However, when I run a BLAST search for these 5 sequences on NCBI, the highest-ranking result is g-proteobacteria, rather than the expected taxon.

I believe this discrepancy might stem from the differences between the BOLD database accession numbers and the NCBI accession numbers. I manually aligned the 5 sequences with the BOLD database, and I found that the identity between them is about 94%, which seems to be correct.

The issue appears to be that Kraken2 is assigning a different taxon compared to what is reported in the BOLD database.

For context, I used the qiime moshpit build-kraken-db command to generate the Kraken2 database for QIIME2. However, I did not use the kraken2-build command from the Kraken2 Python package to transform the database. I suspect that this is the reason for the incorrect taxonomic assignment.

Could you please confirm if the issue could be due to the way I built the Kraken2 database in QIIME2? Is there anything I might be missing in the database preparation process?

I appreciate any guidance or suggestions you can provide!

Best regards,

ebolyen · February 14, 2025, 4:22pm

Hi @SingeunOh,

I think I can confirm that qiime mosphit build-kraken-db appears to fetch NCBI taxonomy:

github.com/bokulich-lab/q2-annotate

q2_annotate/kraken2/database.py

main


      
          def _fetch_taxonomy(db_dir: str, threads: int, use_ftp: bool):
              cmd = [
                  "kraken2-build", "--download-taxonomy",
                  "--threads", str(threads), "--db", str(db_dir),
              ]
              cmd.append("--use-ftp") if use_ftp else False
              try:
                  run_command(cmd=cmd, verbose=True)
              except subprocess.CalledProcessError as e:
                  raise Exception(
                      "An error was encountered while downloading taxonomy, "
                      f"(return code {e.returncode}), please inspect "
                      "stdout and stderr to learn more."
                  )

@misialq, I think we need to work out a way to provide taxonomy to custom DB's.

To get moving again, it should be possible to construct the kraken2 database outside of QIIME 2 and then import it (as you would sequence data or a feature-table).

The command will look something like this:

qiime tools import --type 'Kraken2DB' \
   --input-path <your directory> \
   --input-format 'Kraken2DBDirectoryFormat' \
   --output-path <where you want it>

And <your directory> should contain files named:

hash.k2d
opts.k2d
taxo.k2d

Hopefully that's helpful.