Importing Taxonomy into Qiime2

Hi there!

I can’t seem to find consensus on this, and it seems that the format may have changed in different versions. I’m running Qiime2 amplicon v2024.10.

I did some Sanger sequencing of underrepresented sequences for an eDNA metabarcoding project and want to include those in my reference database. I can import the sequences + taxonomy correctly, filter, dereplicate, etc. the sequences + taxonomy, but when I get all the way down to building a classifier, it tells me that the taxonomic levels are uneven. I think this is because of how I imported the Sanger sequencing taxonomies.

The exported taxonomy from the GenBank data downloaded rescriptlooks like this:

Feature ID      Taxon
PX238419.1      k__Metazoa; p__Chordata; c__Mammalia; o__Primates; f__Hominidae; g__Homo; s__sapiens
PQ448997.1      k__Metazoa; p__Chordata; c__Aves; o__Anseriformes; f__Anatidae; g__Anser; s__erythropus

My data either looks like this:

Feature ID      Taxon   Unnamed Column 1        Unnamed Column 2        Unnamed Column 3        Unnamed Column 4        Unnamed Column 5        Unnamed Column 6        Unnamed Column 7
FISH027_12S_Alosa_aestivalis    k__Animalia;    p__Chordata;    c__Actinopterygii;      o__Clupeiformes;        f__Alosidae;    g__Alosa;       s__aestivalis;  
FISH002_12S_Campostoma_anomalum k__Animalia;    p__Chordata;    c__Actinopterygii;      o__Cypriniformes;       f__Cyprinidae;  g__Campostoma;  s__anomalum;

Or this:

Feature ID      Taxon  
FISH027_12S_Alosa_aestivalis    k__Animalia;p__Chordata;c__Actinopterygii;      o__Clupeiformes;f__Alosidae;g__Alosa;s__aestivalis;  
FISH002_12S_Campostoma_anomalum k__Animalia;p__Chordata;c__Actinopterygii;o__Cypriniformes;f__Cyprinidae;g__Campostoma;s__anomalum;

I get the former when the Taxon column is space or tab delimited, I get the latter when there are no spaces after ; in my input taxonomy. Here is the raw, headerless taxonomy that I use for importing (with no spaces):

FISH027_12S_Alosa_aestivalis    k__Animalia;p__Chordata;c__Actinopterygii;o__Clupeiformes;f__Alosidae;g__Alosa;s__aestivalis;
FISH002_12S_Campostoma_anomalum k__Animalia;p__Chordata;c__Actinopterygii;o__Cypriniformes;f__Cyprinidae;g__Campostoma;s__anomalum;

And the command I use to import it:

qiime tools import --input-path 12s_sanger_taxonomy_consenus.txt \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--output-path 12s_sanger_taxonomy.qza

What is the correct way to import taxonomy so it can be added to a classifier with the rescript data?

(I will gladly follow this up with a feature request to give warnings when you merge taxonomies with differing levels!)

1 Like

Hi @alexkrohn,

I think your approach with the headerless taxonomy is fine. I can explain the behavior you are seeing in the former case:

The intention of the format is to have a single column for the taxonomy and support additional metadata. These additional columns might be indicated on the semantic type with a % Properties('my_column') where my_column is somehow interesting and useful to a tool or plugin.

But the primary data of FeatureData[Taxonomy] lives in those first two named columns of Feature ID and Taxon which is the "specification" of the format so to speak.

So when you have tab separated columns it sees the jagged rows as additional columns which ends up looking sensible enough, but won't be interpreted correctly as only the Taxon column will be used. (The space character doing this as well is a bit odd, so would want to reproduce before claiming much)

2 Likes

Sorry, I lost your actual question for all of the exciting examples you posted.

Could you post the specific error (and traceback ideally)? I'm trying to find such a phrase in q2-feature-classifier but am coming up short.

Hi @evangeL !

Thanks for the reply. What is the correct format for the second column in a headerless TSV? If there are no spaces between levels (after ;), the Taxon is imported as a single string, if there is a space or tab, it’s imported as multiple columns.

I think this is enough to reproduce the error. Using the data found in this folder, and the code below, you should get the error below. I get this error regardless of whether the Taxon column in the taxonomy txt is tab delimited, space delimited or with nothing after ;.

(FWIW, this is not my usual pipeline, which involves filtering, dereplicating, extracting sequences with my primer, etc. But this does reproduce the error without all those steps.)

# Download NCBI 12S data from all vertebrates
qiime rescript get-ncbi-data --p-query "(12S OR 12S s-rRNA) AND 1:20000[SLEN] AND txid7742[orgn]" \
 --o-sequences 12s-unfiltered-seqs.qza \
—o-taxonomy 12s-taxonomy-unfiltered.qza \
—p-n-jobs 10

# Import manual Sanger sequencing data
qiime tools import --input-path 12s_sanger_fishes_consenus.fa --type 'FeatureData[Sequence]' --output-path 12s_sanger_reference_seqs.qza

qiime tools import --input-path 12s_sanger_taxonomy_consensus_no_Rs.txt --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat --output-path 12s_sanger_taxonomy.qza


# Merge seqs and taxa with downloaded NCBI data
qiime feature-table merge-seqs --i-data 12s_sanger_reference_seqs.qza 12s-unfiltered-seqs.qza \
 --o-merged-data 12s-fish-refseq-and-sanger.qza

qiime feature-table merge-taxa --i-data 12s_sanger_taxonomy.qza 12s-taxonomy-unfiltered.qza \
 --o-merged-data 12s-fish-refseq-sanger-filtered-merged-taxonomy.qza


# Make a classifier
qiime rescript evaluate-fit-classifier --i-sequences 12s-fish-refseq-and-sanger.qza \
—o-classifier test \
—i-taxonomy 12s-fish-refseq-sanger-filtered-merged-taxonomy.qza \
—o-evaluation test_eval \
—o-observed-taxonomy test_obs_tax

Error:

Traceback (most recent call last):
  File "/home/tangled/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/q2cli/commands.py", line 530, in __call__
    results = self._execute_action(
  File "/home/tangled/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/q2cli/commands.py", line 608, in _execute_action
    results = action(**arguments)
  File "<decorator-gen-807>", line 2, in evaluate_fit_classifier
  File "/home/tangled/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/qiime2/sdk/action.py", line 299, in bound_callable
    outputs = self._callable_executor_(
  File "/home/tangled/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/qiime2/sdk/action.py", line 651, in _callable_executor_
    outputs = self._callable(ctx, **view_args)
  File "/home/tangled/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/rescript/cross_validate.py", line 33, in evaluate_fit_classifier
    taxa, seq_ids = _validate_cross_validate_inputs(taxonomy, sequences)
  File "/home/tangled/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/rescript/cross_validate.py", line 138, in _validate_cross_validate_inputs
    _validate_even_rank_taxonomy(taxa)
  File "/home/tangled/miniconda3/envs/qiime2-amplicon-2024.10/lib/python3.10/site-packages/rescript/cross_validate.py", line 329, in _validate_even_rank_taxonomy
    raise ValueError('Taxonomic label depth is uneven. All taxonomies '
ValueError: Taxonomic label depth is uneven. All taxonomies must have the same number of semicolon-delimited ranks. The following features are too short: FISH002_12S_Campostoma_anomalum, FISH002_12S_Campostoma_anomalum_R, FISH003_12S_Cottus_sp., FISH003_12S_Cottus_sp._R, FISH005_12S_Ichthyomyzon_greeleyi, FISH005_12S_Ichthyomyzon_greeleyi_R, FISH008_12S_Clinostomus_Smoky, FISH008_12S_Clinostomus_Smoky_R, FISH009_12S_Hypentelium_nigricans, FISH009_12S_Hypentelium_nigricans_R, FISH012_12S_Notropis_telescopus, FISH012_12S_Notropis_telescopus_R, FISH013_12S_Luxilus_coccogenis, FISH013_12S_Luxilus_coccogenis_R, FISH014_12S_Notropis_leuciodus, FISH014_12S_Notropis_leuciodus_R, FISH015_12S_Percina_squamata, FISH015_12S_Percina_squamata_R, FISH016_12S_Salvelinus_fontinalis, FISH016_12S_Salvelinus_fontinalis_R, FISH019_12S_Oncorhynchus_mykiss, FISH019_12S_Oncorhynchus_mykiss_R, FISH020_12S_Etheostoma_gutselli, FISH020_12S_Etheostoma_gutselli_R, FISH022_12S_Notropis_spectrunculus, FISH022_12S_Notropis_spectrunculus_R, FISH024_12S_Salmo_trutta, FISH024_12S_Salmo_trutta_R, FISH027_12S_Alosa_aestivalis, FISH027_12S_Alosa_aestivalis_R, FISH031_12S_Notropis_photogenis, FISH031_12S_Notropis_photogenis_R, FISH033_12S_Moxostoma_duquesnei, FISH033_12S_Moxostoma_duquesnei_R, FISH034_12S_Etheostoma_zonale, FISH034_12S_Etheostoma_zonale_R, FISH035_12S_Phenacobius_crassilabrum, FISH035_12S_Phenacobius_crassilabrum_R, FISH036_12S_Etheostoma_vulneratum, FISH036_12S_Etheostoma_vulneratum_R, FISH039_12S_Percina_aurantiaca, FISH039_12S_Percina_aurantiaca_R, FISH040_12S_Micropterus_dolomieu, FISH040_12S_Micropterus_dolomieu_R, FISH042_12S_Ictalurus_punctatus, FISH042_12S_Ictalurus_punctatus_R, FISH044_12S_Etheostoma_chlorobranchium, FISH044_12S_Etheostoma_chlorobranchium_R, FISH046_12S_Nocomis_micropogon, FISH046_12S_Nocomis_micropogon_R, FISH049_12S_Catostomus_commersonii, FISH049_12S_Catostomus_commersonii_R, FISH050_12S_Percina_evides, FISH050_12S_Percina_evides_R, FISH051_12S_Rhinichthys_cataractae, FISH051_12S_Rhinichthys_cataractae_R, FISH052_12S_Lepomis_macrochirus, FISH052_12S_Lepomis_macrochirus_R, FISH053_12S_Perca_flavescens, FISH053_12S_Perca_flavescens_R, FISH054_12S_Micropterus_nigricans, FISH054_12S_Micropterus_nigricans_R, FISH055_12S_Pylodictis_olivaris, FISH055_12S_Pylodictis_olivaris_R, FISH056_12S_Erimyzon_oblongus, FISH056_12S_Erimyzon_oblongus_R, FISH059_12S_Cyprinella_monacha, FISH059_12S_Cyprinella_monacha_R, FISH061_12S_Noturus_baileyi, FISH061_12S_Noturus_baileyi_R, FISH066_12S_Noturus_flavipinnis, FISH066_12S_Noturus_flavipinnis_R, FISH070_12S_Noturus_flavus, FISH070_12S_Noturus_flavus_R, FISH073_12S_Percina_caprodes, FISH073_12S_Percina_caprodes_R, FISH075_12S_Noturus_eleutherus, FISH075_12S_Noturus_eleutherus_R, FISH078_12S_Hybopsis_amblops, FISH078_12S_Hybopsis_amblops_R, FISH083_12S_Notropis_rubricroceus, FISH083_12S_Notropis_rubricroceus_R, FISH087_12S_Notemigonus_crysoleucas, FISH087_12S_Notemigonus_crysoleucas_R, FISH091_12S_Lepomis_cyanellus, FISH091_12S_Lepomis_cyanellus_R, FISH093_12S_Etheostoma_blennioides, FISH093_12S_Etheostoma_blennioides_R, FISH097_12S_Lepomis_auritus, FISH097_12S_Lepomis_auritus_R, FISH099_12S_Etheostoma_rufilineatum, FISH099_12S_Etheostoma_rufilineatum_R, FISH103_12S_Micropterus_punctulatus, FISH103_12S_Micropterus_punctulatus_R, FISH105_12S_Lepomis_gulosus, FISH105_12S_Lepomis_gulosus_R, FISH106_12S_Rhinichthys_obtusus, FISH106_12S_Rhinichthys_obtusus_R, FISH108_12S_Cyprinella_galactura, FISH108_12S_Cyprinella_galactura_R, FISH113_12S_Percina_burtoni, FISH113_12S_Percina_burtoni_R, SFRH_12S_Moxostoma_Sicklefin, SFRH_12S_Moxostoma_Sicklefin_R

1 Like

Hey @alexkrohn ,

It looks like the taxonomy that you are importing has an extra semicolon at the end of each row, e.g.,

This is creating an empty 8th level after importing. Compare closely to the NCBI taxonomy generated by RESCRIPt and you will see that this hanging punctuation is not there.

The Taxon column should be a single semicolon-delimited string.

Let us know if that does the trick!

2 Likes

:smiling_face_with_tear:

After all that, a trailing semicolon was my downfall! That totally fixed it. Thanks @Nicholas_Bokulich !

I would happily submit a feature request that imported taxonomies with one level trigger a warning, or that merging taxonomies with differing levels triggers a warning. Should I add that to the Developer Discussion category, or is that too much work for an edge case?

2 Likes

I am not sure — it has crossed our minds before but we never made even levels a requirement for the format because maybe there are use cases when an uneven taxonomy would be desired? This is why we ultimately decided to catch this with warnings in plugins that require it, rather than enforcing this in the format itself (so that it would be caught during import or when an artifact is generated). It’s certainly not too much work, it has always just been a philosophical question of whether taxonomies must be even-ranked… my gut tells me no.

On the other hand, we could make this a property so that some actions explicitly require an evenly ranked taxonomy, others do not. Catching this with a warning is much less work in the end I think…

Glad we could solve your issue! Happy :qiime2: ing

1 Like

Yes please!

This is exactly the kind of 'small' problem that causes a lot of 'small' issues for a large number of people!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.