mismatch between taxonomy and sequence data that download from NCBI using qiime rescript get-ncbi-data

Hello !
I am trying to download COI data from NCBI with this code:

qiime rescript get-ncbi-data \
 --p-query '(cytochrome c oxidase subunit I[gene] OR cytochrome oxidase subunit 1[gene] OR cytochrome oxidase subunit I[gene] OR COX1[gene] OR CO1[gene] OR COI[gene] OR COXI[gene] NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title] NOT txid2[ORGN] NOT txid2157[ORGN] NOT txid10239[ORGN])' \
 --verbose --p-logging-level INFO \
 --p-n-jobs 8 \
 --o-sequences CO1_sequences.qza \
 --o-taxonomy CO1_taxonomy.qza

And then I export CO1_sequences.qza, and CO1_taxonomy.qza using the code below:

qiime tools export \
  --input-path  CO1_sequences.qza \
  --output-path exported-seq-co1

qiime tools export \
  --input-path  CO1_taxonomy.qza \
  --output-path exported-taxonomy-co1

Subsequently, I converted the .fasta file in the exported-seq-co1 folder to .tsv format and compared it with the taxonomy data in the exported-taxonomy-co1 folder. I discovered that not all sequence IDs (accessions) present in CO1_sequences are found in CO1_taxonomy, and I am uncertain about the reason for this discrepancy.

Thanks for helping me!

Hi @Si_Vivy,

Thank you for providing your commands. Very helpful! :slight_smile:

This thread may help explain the issue:

-Mike

1 Like

Hi @Si_Vivy,

I was able to successfully run your command without issue via QIIME 2 version 2025.4. That is, both files contained the same IDs.

I ran:

qiime rescript get-ncbi-data \   
 --p-query '(cytochrome c oxidase subunit I[gene] OR cytochrome oxidase subunit 1[gene] OR cytochrome oxidase subunit I[gene] OR COX1[gene] OR CO1[gene] OR COI[gene] OR COXI[gene] NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title] NOT txid2[ORGN] NOT txid2157[ORGN] NOT txid10239[ORGN])' \
 --verbose --p-logging-level INFO \
 --p-n-jobs 8 \
 --o-sequences CO1_sequences.qza \
 --o-taxonomy CO1_taxonomy.qza

Then I was able to successfully run the following command. Which should fail if the IDs between the two do not match :

qiime rescript dereplicate \
    --i-sequences CO1_sequences.qza \
    --i-taxa CO1_taxonomy.qza \
    --p-threads 10 \
    --output-dir CO1_derep  \
    --verbose

To further ensure that the files contained the same information in both the sequence and taxonomy files I ran the following:

qiime tools export \
    --input-path CO1_sequences.qza \
    --output-path CO1_sequences_export

qiime tools export \
    --input-path CO1_taxonomy.qza \
    --output-path CO1_taxonomy_export 

# grab IDs and save to file
cat CO1_sequences_export/dna-sequences.fasta | egrep '^>' | sed 's/>//' > seq_ids.txt
cat CO1_taxonomy_export/taxonomy.tsv | tail -n +2 | cut -d $'\t' -f 1 > tax_ids.txt

# run diff to see if there are any differences (the IDs should be in the same order):
diff seq_ids.txt tax_ids.txt

The diff command revealed no differences between the IDs in the FASTA and Taxonomy files. I am guessing that there was a network issue, or something that caused the files to files to be incomplete, and/or the version of QIIME 2 / RESCRIPt is an older version? :man_shrugging:

2 Likes

Hi Mike,

Thanks so much for your reply; the problem was solved when I used version 2025.4.

vivy

2 Likes