I think the mismatch of the files occurred prior to the dereplication step. Can you provide the commands you ran prior to this? For example I noticed that your taxonomy file is named
tax_noSH_derep.qza
. What commands were used to generate this file?
I followed the tutorial here first: How to train a UNITE classifier using RESCRIPt
qiime rescript get-unite-data \
--p-version 9.0 \
--p-taxon-group eukaryotes \
--p-cluster-id dynamic \
--p-singletons \
--output-dir unite_rescript \
--verbose &> get_unite_data_verbose.log & disown
###removing sequences with unhelpful taxonomy
qiime taxa filter-seqs \
--p-exclude Fungi_sp,mycota_sp,mycetes_sp \
--i-taxonomy taxonomy.qza \
--i-sequences sequences.qza \
--o-filtered-sequences sequences_filtered.qza
###removing the specific accessions as annotated within UNITE
qiime rescript edit-taxonomy \
--i-taxonomy taxonomy.qza \
--o-edited-taxonomy tax_noSH.qza \ ###the outputted taxonomy file
--p-search-strings ';sh__.*' \
--p-replacement-strings '' \
--p-use-regex
And then I started the curation using your tutorial starting with the first dereplication step:
qiime rescript dereplicate \
--i-sequences sequences_filtered.qza \
--i-taxa tax_noSH.qza \
--p-mode 'uniq' \
--p-threads 8 \
--o-dereplicated-sequences sequences_filtered_derep.qza \
--o-dereplicated-taxa tax_noSH_derep.qza
Also I looked at the 2 files and you are correct in that SH1089862.09FU_UDB0271834_reps
is missing from the taxonomy file. I went in manually and deleted the fasta sequences because it matched to unhelpful taxonomy k__Eukaryota_kgd_Incertae_sedis;p__Eukaryota_phy_Incertae_sedis;c__Eukaryota_cls_Incertae_sedis;o__Eukaryota_ord_Incertae_sedis;f__Eukaryota_fam_Incertae_sedis;g__Eukaryota_gen_Incertae_sedis;s__Eukaryota_sp;sh__SH1089862.09FU
And ran it again, only to encounter the same problem with another sequence.