Hello !
I am trying to download COI data from NCBI with this code:
qiime rescript get-ncbi-data \
--p-query '(cytochrome c oxidase subunit I[gene] OR cytochrome oxidase subunit 1[gene] OR cytochrome oxidase subunit I[gene] OR COX1[gene] OR CO1[gene] OR COI[gene] OR COXI[gene] NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title] NOT txid2[ORGN] NOT txid2157[ORGN] NOT txid10239[ORGN])' \
--verbose --p-logging-level INFO \
--p-n-jobs 8 \
--o-sequences CO1_sequences.qza \
--o-taxonomy CO1_taxonomy.qza
And then I export CO1_sequences.qza, and CO1_taxonomy.qza using the code below:
Subsequently, I converted the .fasta file in the exported-seq-co1 folder to .tsv format and compared it with the taxonomy data in the exported-taxonomy-co1 folder. I discovered that not all sequence IDs (accessions) present in CO1_sequences are found in CO1_taxonomy, and I am uncertain about the reason for this discrepancy.
I was able to successfully run your command without issue via QIIME 2 version 2025.4. That is, both files contained the same IDs.
I ran:
qiime rescript get-ncbi-data \
--p-query '(cytochrome c oxidase subunit I[gene] OR cytochrome oxidase subunit 1[gene] OR cytochrome oxidase subunit I[gene] OR COX1[gene] OR CO1[gene] OR COI[gene] OR COXI[gene] NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title] NOT txid2[ORGN] NOT txid2157[ORGN] NOT txid10239[ORGN])' \
--verbose --p-logging-level INFO \
--p-n-jobs 8 \
--o-sequences CO1_sequences.qza \
--o-taxonomy CO1_taxonomy.qza
Then I was able to successfully run the following command. Which should fail if the IDs between the two do not match :
To further ensure that the files contained the same information in both the sequence and taxonomy files I ran the following:
qiime tools export \
--input-path CO1_sequences.qza \
--output-path CO1_sequences_export
qiime tools export \
--input-path CO1_taxonomy.qza \
--output-path CO1_taxonomy_export
# grab IDs and save to file
cat CO1_sequences_export/dna-sequences.fasta | egrep '^>' | sed 's/>//' > seq_ids.txt
cat CO1_taxonomy_export/taxonomy.tsv | tail -n +2 | cut -d $'\t' -f 1 > tax_ids.txt
# run diff to see if there are any differences (the IDs should be in the same order):
diff seq_ids.txt tax_ids.txt
The diff command revealed no differences between the IDs in the FASTA and Taxonomy files. I am guessing that there was a network issue, or something that caused the files to files to be incomplete, and/or the version of QIIME 2 / RESCRIPt is an older version?