keyerror with RESCRIPt dereplicate and UNITE database

emmlemore · January 23, 2024, 2:44pm

I think the mismatch of the files occurred prior to the dereplication step. Can you provide the commands you ran prior to this? For example I noticed that your taxonomy file is named tax_noSH_derep.qza. What commands were used to generate this file?

I followed the tutorial here first: How to train a UNITE classifier using RESCRIPt

qiime rescript get-unite-data \
    --p-version 9.0 \
    --p-taxon-group eukaryotes \
    --p-cluster-id dynamic \
    --p-singletons \
    --output-dir unite_rescript \
    --verbose &> get_unite_data_verbose.log & disown

###removing sequences with unhelpful taxonomy
qiime taxa filter-seqs \
    --p-exclude Fungi_sp,mycota_sp,mycetes_sp \
    --i-taxonomy taxonomy.qza \
    --i-sequences sequences.qza \
    --o-filtered-sequences sequences_filtered.qza

###removing the specific accessions as annotated within UNITE
qiime rescript edit-taxonomy \
    --i-taxonomy taxonomy.qza \
    --o-edited-taxonomy tax_noSH.qza \   ###the outputted taxonomy file
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

And then I started the curation using your tutorial starting with the first dereplication step:

qiime rescript dereplicate \
    --i-sequences sequences_filtered.qza \
    --i-taxa tax_noSH.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences sequences_filtered_derep.qza \
    --o-dereplicated-taxa tax_noSH_derep.qza

Also I looked at the 2 files and you are correct in that SH1089862.09FU_UDB0271834_reps is missing from the taxonomy file. I went in manually and deleted the fasta sequences because it matched to unhelpful taxonomy k__Eukaryota_kgd_Incertae_sedis;p__Eukaryota_phy_Incertae_sedis;c__Eukaryota_cls_Incertae_sedis;o__Eukaryota_ord_Incertae_sedis;f__Eukaryota_fam_Incertae_sedis;g__Eukaryota_gen_Incertae_sedis;s__Eukaryota_sp;sh__SH1089862.09FU

And ran it again, only to encounter the same problem with another sequence.